{"data":{"kind":"file","path":"README.md","version_id":"v9hh7bvdgjovxm6zti744xjq","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":9432,"modified_at":"2025-11-14T13:25:18.614000","content_hash":"c21174b783eb402f41c8919edb43321cf450bc36b8dbd5dd05368fb6d608389d"},"entries":[],"content":"# temporal-bench: Evaluating Temporal Reasoning in Language Models\n\n### Overview\n- **Environment ID**: `temporal-bench`\n- **Short description**: A temporal reasoning evaluation environment for LLMs, evaluating models on their ability to reason about time sequences, durations, and temporal relationships.\n- **Tags**: eval, reasoning, single-turn, temporal, temporal-bench\n\n### Find this environment on Prime Intellect\nThis environment is available on the [Prime Intellect Environment's Repository](https://app.primeintellect.ai/dashboard/environments/runes/temporal-bench). Prime Intellect is building a community platform for crowdsourcing open environments, so anyone can contribute to open-source AGI research and evaluation.\n\n### Citation\nIf you use this environment, please cite:\n- TRAM Paper: [TRAM: Benchmarking Temporal Reasoning for Large Language Models](https://arxiv.org/abs/2310.00835) by Yuqing Wang, Yun Zhao.\n- Hugging Face dataset: Warrieryes/TRAM-Temporal\n\n### Datasets\n- **Primary dataset(s)**: TRAM (Temporal Reasoning About Events) dataset loaded via the Hugging Face datasets library\n- **Source links**: [TRAM-Temporal dataset on Hugging Face](https://huggingface.co/datasets/Warrieryes/TRAM-Temporal)\n- **Split sizes**: Uses test split with 3,800 examples and train split with 59,691 examples\n\n### Task Types\nThe TRAM dataset contains 38 different types of temporal reasoning tasks, organized into thematic categories. You can filter by specific task type or by category:\n\n#### Category Table\n| Category | Task Types Available | Description |\n|----------|----------------------|-------------|\n| `ambiguity_resolution_*` | `ambiguity_resolution_interpretation`, `ambiguity_resolution_shift_calendar`, `ambiguity_resolution_shift_lt`, `ambiguity_resolution_shift_mt`, `ambiguity_resolution_shift_st` | Resolving ambiguous temporal expressions and references |\n| `arithmetic_*` | `arithmetic_application`, `arithmetic_date_computation`, `arithmetic_hour_adjustment(12h)`, `arithmetic_hour_adjustment (24h)`, `arithmetic_month_shift`, `arithmetic_time_computation`, `arithmetic_time_zone_conversion`, `arithmetic_week_identification`, `arithmetic_year_shift` | Tasks involving mathematical computations with temporal concepts |\n| `causality_*` | `causality_cause`, `causality_effect` | Identifying cause-and-effect relationships with temporal aspects |\n| `duration_*` | `duration_analogy_inference`, `duration_commonsense`, `duration_computation`, `duration_direct_comparison`, `duration_facts`, `duration_multi-step_comparison`, `duration_reading_comprehension` | Reasoning about lengths of time and duration concepts |\n| `frequency_*` | `frequency_application`, `frequency_commonsense`, `frequency_comparison`, `frequency_computation`, `frequency_facts`, `frequency_reading_comprehension` | Reasoning about how often events occur over time |\n| `nli` | `nli` | Natural Language Inference with temporal aspects |\n| `ordering_*` | `ordering_commonsense`, `ordering_facts` | Understanding sequence and temporal order of events |\n| `relation` | `relation` | Reasoning about temporal relations |\n| `storytelling` | `storytelling` | Temporal aspects in narrative contexts |\n| `typical_time_*` | `typical_time_comparsion`, `typical_time_commonsense`, `typical_time_facts`, `typical_time_reading_comprehension` | Understanding typical time durations and expectations |\n\n#### Individual Task Types\n\n| Task Type | Description | Count (Train) |\n|-----------|-------------|---------------|\n| ambiguity_resolution_interpretation | Resolving ambiguous temporal expressions through interpretation | 290 |\n| ambiguity_resolution_shift_calendar | Resolving temporal ambiguities with calendar shifts | 195 |\n| ambiguity_resolution_shift_lt | Resolving long-term temporal ambiguities | 495 |\n| ambiguity_resolution_shift_mt | Resolving medium-term temporal ambiguities | 1,249 |\n| ambiguity_resolution_shift_st | Resolving short-term temporal ambiguities | 895 |\n| arithmetic_application | Applying arithmetic to temporal problems | 1,937 |\n| arithmetic_date_computation | Computing dates using arithmetic | 5,895 |\n| arithmetic_hour_adjustment(12h) | 12-hour format hour adjustments | 1,395 |\n| arithmetic_hour_adjustment (24h) | 24-hour format hour adjustments | 1,395 |\n| arithmetic_month_shift | Shifting months using arithmetic | 35 |\n| arithmetic_time_computation | Computing time using arithmetic | 875 |\n| arithmetic_time_zone_conversion | Converting between time zones | 395 |\n| arithmetic_week_identification | Identifying weeks using arithmetic | 1,392 |\n| arithmetic_year_shift | Shifting years using arithmetic | 1,365 |\n| causality_cause | Identifying temporal causes | 195 |\n| causality_effect | Identifying temporal effects | 195 |\n| duration_analogy_inference | Duration inference using analogies | 695 |\n| duration_commonsense | Commonsense reasoning about durations | 210 |\n| duration_computation | Computing durations using arithmetic | 1,395 |\n| duration_direct_comparison | Direct comparisons of durations | 1,895 |\n| duration_facts | Factual knowledge about durations | 30 |\n| duration_multi-step_comparison | Multi-step duration comparisons | 1,395 |\n| duration_reading_comprehension | Reading comprehension with duration focus | 877 |\n| frequency_application | Applying frequency concepts to temporal problems | 1,895 |\n| frequency_commonsense | Commonsense reasoning about frequencies | 190 |\n| frequency_comparison | Comparing frequencies | 695 |\n| frequency_computation | Computing frequencies | 1,095 |\n| frequency_facts | Factual knowledge about frequencies | 43 |\n| frequency_reading_comprehension | Reading comprehension with frequency focus | 110 |\n| nli | Natural Language Inference with temporal aspects | 5,000 |\n| ordering_commonsense | Commonsense reasoning about temporal order | 2,357 |\n| ordering_facts | Factual knowledge about temporal order | 5,000 |\n| relation | Reasoning about temporal relations | 5,000 |\n| storytelling | Temporal aspects in storytelling | 5,000 |\n| typical_time_comparsion | Comparing typical time durations | 496 |\n| typical_time_commonsense | Commonsense about typical time durations | 113 |\n| typical_time_facts | Factual knowledge about typical time durations | 3,002 |\n| typical_time_reading_comprehension | Reading comprehension with typical time focus | 5,000 |\n\n**Note:** The test split contains 3,800 examples distributed across these task types.\n\nTo select specific task types, use the `task_type` environment argument as described below.\n\n### Task\n- **Type**: single-turn\n- **Parser**: Custom `TemporalBenchParser` that extracts the final lettered answer (e.g., 'B') or formatted answer from the model's output using regex.\n- **Rubric overview**: The reward is calculated by an exact match reward function, which returns 1.0 for correct answers and 0.0 for incorrect ones.\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval temporal-bench\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval temporal-bench -m gpt-4.1-mini -n 20 -r 3 -t 8192 -T 0.7 -a '{\"dataset_split\":\"test\"}'\n```\n\nFor specific task filtering:\n\n```bash\n# Evaluate on specific task type\nuv run vf-eval temporal-bench -m gpt-4.1-mini -n 10 -a '{\"task_type\":\"arithmetic_date_computation\", \"dataset_split\":\"test\"}'\n\n# Evaluate on category of tasks (using wildcards)\nuv run vf-eval temporal-bench -m gpt-4.1-mini -n 10 -a '{\"task_type\":\"causality_*\", \"dataset_split\":\"test\"}'\n\n# Evaluate on all arithmetic-related tasks\nuv run vf-eval temporal-bench -m gpt-4.1-mini -n 10 -a '{\"task_type\":\"arithmetic_*\", \"dataset_split\":\"test\"}'\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n- The `task_type` parameter supports both specific task types and category wildcards (e.g., `causality_*`, `arithmetic_*`)\n\n### Environment Arguments\nDocument any supported environment arguments and their meaning. Example:\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `dataset_split` | str | `\"test\"` | Dataset split to use (train/test) |\n| `task_type` | str | `None` | Filter to specific task type (e.g., 'ambiguity_resolution_interpretation', 'arithmetic_date_computation', 'ordering_commonsense', etc.) or category (e.g., 'causality_*', 'arithmetic_*', 'ambiguity_resolution_*', etc.). Use '*' suffix to filter by category prefix. If None, uses all task types. |\n| `system_prompt` | str | `\"Evaluate the sequence, duration, or timing of events in the given context. For multiple choice questions, respond with only the letter of the correct answer choice (such as 'A', 'B', 'C', etc.) with no additional text.\"` | System prompt to guide model behavior |\n\n### Using the Environment for Evaluation\nTo use this environment in your evaluations:\n1. Install the environment with: `prime env install runes/temporal-bench`\n2. Run evaluations with: `uv run vf-eval temporal-bench -m [your-model] -n [number-of-examples]`\n3. For specific task types: `uv run vf-eval temporal-bench -m [your-model] -n [number] -a '{\"task_type\":\"arithmetic_date_computation\"}'`\n4. For more details about the environment, visit: [https://app.primeintellect.ai/dashboard/environments/runes/temporal-bench](https://app.primeintellect.ai/dashboard/environments/runes/temporal-bench)\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Main scalar reward, 1.0 if the chosen answer is correct, else 0.0 |","encoding":"utf-8","truncated":false,"total_bytes":9432},"status":null}