{"data":{"kind":"file","path":"README.md","version_id":"jrm7f4vzaotg8eim7qimsy84","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3325,"modified_at":"2025-08-30T20:41:41.788000","content_hash":"d828afb273e5d77f87fb4b3bf12807bae4a2f01c21e840c074ad3f97e4fd7914"},"entries":[],"content":"# futurex-past\n\n### Overview\n- **Environment ID**: `futurex-past`\n- **Short description**: Single-turn historical QA over FutureX-Past Parquet shards with optional MCQ handling and exact-match scoring.\n- **Tags**: futurex, qa, single-turn, train, eval\n\n### Datasets\n- **Primary dataset(s)**: FutureX-Past (historical QA; some items have `options` for MCQ and a `level` field).\n- **Source**: Hugging Face Hub: `futurex-ai/Futurex-Past` (default). Local Parquet override supported via `data_dir`.\n- **Split sizes**: No fixed split; data is shuffled deterministically and optionally truncated via `num_train_examples`/`num_eval_examples`.\n\n### Task\n- **Type**: single-turn\n- **Parser**: XMLParser with `<think>` (optional) and `<answer>` fields; `use_think` toggles requiring `<think>`.\n- **Rubric overview**: Exact-match reward (supports MCQ label/text equivalence and normalization) + format reward for valid XML fields.\n\n### Quickstart\nRun an evaluation (defaults to HF dataset):\n\n```bash\nuv run vf-eval futurex-past\n```\n\nAlternatively use a local Parquet directory (override HF):\n\n```bash\nuv run vf-eval futurex-past -a '{\"data_dir\": \"/path/to/Futurex-Past/data\"}'\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval futurex-past \\\n  -m gpt-4.1-mini \\\n  -n 20 -r 3 -t 1024 -T 0.7 \\\n  -a '{\n        \"use_think\": true,\n        \"mcq_only\": null,\n        \"filter_levels\": [1,2,3],\n        \"num_train_examples\": 200,\n        \"num_eval_examples\": 100\n      }'\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as JSON.\n- Data source priority: `data_dir` (local Parquet) → `dataset_name` (HF repo, default: `futurex-ai/Futurex-Past`).\n\n### Sanity Check Script\nRun a quick local check without a model by scoring gold answers (defaults to HF dataset):\n\n```bash\npython environments/futurex_past/scripts/sanity_check.py --num-train 10 --num-eval 3 --use-think true\n```\n\nThis prints a small JSON with rewards and metrics for a few examples.\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `data_dir` | str | `null` | Directory containing FutureX-Past Parquet shards. |\n| `dataset_name` | str | `\"futurex-ai/Futurex-Past\"` | HF dataset repo to load when `data_dir` is not provided. |\n| `dataset_split` | str | `\"train\"` | HF dataset split name. |\n| `use_think` | bool | `true` | Require `<think>` + `<answer>`; if false, only `<answer>`. |\n| `num_train_examples` | int | `-1` | Limit train set size; `-1` means all. |\n| `num_eval_examples` | int | `-1` | Limit eval size; mirrors train when `-1`. |\n| `filter_levels` | list[int] | `null` | Keep rows with `level` in this list. |\n| `mcq_only` | bool or null | `null` | `true`: only MCQ; `false`: only non-MCQ; `null`: all. |\n| `lowercase_compare` | bool | `false` | Lowercase both sides before exact match. |\n| `strip_whitespace_compare` | bool | `true` | Strip whitespace before comparison. |\n| `system_prompt` | str | `null` | Custom system prompt override. |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | 1.0 for exact match (after optional normalization/MCQ label mapping), else 0.0. |\n| `format_reward` | Bonus for providing required XML fields per parser. |\n\n### Evaluation Reports\n\nReports generated by CI or local runs will auto-render here when available.\n","encoding":"utf-8","truncated":false,"total_bytes":3325},"status":null}