{"data":{"kind":"file","path":"README.md","version_id":"zdc5h9peslm2n4x6ass6jru1","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2913,"modified_at":"2025-10-28T19:18:48.911000","content_hash":"b0c7830166cdfaccbc97dc0801cf53d2885b5ac652a62e369ab1cfaec48242cd"},"entries":[],"content":"# nanochatAquaRat Environment\n\nPrime Intellect verifier environment that mirrors the nanochat AQuA-RAT reinforcement-learning task: single-turn algebra questions with multiple-choice answers (letters A–E) scored by categorical accuracy.\n\n## Overview\n- **Hub ID**: `harleycooper/nanochatAquaRat`\n- **Task type**: Single-turn chat\n- **Parser**: `verifiers.Parser` with a custom A–E letter extractor\n- **Rubric**: Exact-match reward (weight 1.0) plus valid-letter format bonus (weight 0.1)\n\n## Dataset\n- **Source**: [deepmind/aqua_rat](https://huggingface.co/datasets/deepmind/aqua_rat)\n- **Content**: ~97k algebra word problems, five answer options, human rationale, gold letter.\n- **Default splits**: `train` for rollouts, `validation` for evaluation (configurable).\n- **Metadata**: question stem, options, and optional rationale retained per example.\n\nBy default the loader streams from Hugging Face. For offline use, pass `data_dir=/path/to/aqua` where that directory contains `train.jsonl`, `validation.jsonl`, and `test.jsonl` generated via `scripts/prepare_aqua.py` in the base repository.\n\n## Quickstart\nEvaluate a model on the validation set:\n\n```bash\nuv run vf-eval harleycooper/nanochatAquaRat -m gpt-4o-mini -n 25\n```\n\nKick off GRPO training (LoRA-friendly defaults shown):\n\n```bash\nuv run vf-rl @ configs/rl/nanochat.toml\n```\n\nExample `configs/rl/nanochat.toml` excerpt:\n\n```toml\nmodel = \"Qwen/Qwen2.5-7B-Instruct\"\n\n[env]\nid = \"harleycooper/nanochatAquaRat\"\n\n[env.args]\nnum_train_examples = 2000\nnum_eval_examples = 254\nseed = 42\n\n[trainer.args]\nlearning_rate = 2e-5\nrollouts_per_example = 8\nmax_steps = 400\n```\n\n## Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `system_prompt` | str | Algebra tutoring instruction | Prepended system message |\n| `train_split` | str | `\"train\"` | Dataset split used for rollouts |\n| `eval_split` | str\\|null | `\"validation\"` | Split for evaluation (`null` reuses train) |\n| `num_train_examples` | int | `-1` | Cap on rollout examples after shuffling |\n| `num_eval_examples` | int | `-1` | Cap on evaluation examples |\n| `seed` | int\\|null | `42` | Deterministic shuffle seed for the train split |\n| `include_rationale_metadata` | bool | `true` | Include human rationale text in metadata |\n| `data_dir` | str\\|null | `null` | Local directory containing JSON/JSONL splits |\n| `cache_dir` | str\\|null | `null` | Hugging Face cache override |\n\nPass overrides with `vf-eval ... --env-args '{\"num_train_examples\": 5000}'`.\n\n## Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Weighted reward (exact-match + format bonus) |\n| `exact_match_reward` | Raw exact-match signal prior to weighting |\n| `format_reward` | Bonus for emitting a valid letter token |\n\n`reward` aligns with the `rl/acc` tracking used in the nanochat RL scripts, so you can compare outcomes across training setups.*** End Patch\n","encoding":"utf-8","truncated":false,"total_bytes":2913},"status":null}