{"data":{"kind":"file","path":"README.md","version_id":"hekvoutcyvz1a602bo5c4qgn","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3598,"modified_at":"2026-01-21T02:18:02.641000","content_hash":"9e02fda69f37ef5aa4719d4f50b0abfc4b9aeb7eb8bdcf737b5600f083efedaf"},"entries":[],"content":"# gsm8k-multireward\n\n### Overview\n- **Environment ID**: `gsm8k-multireward`\n- **Short description**: GSM8K with multi-reward support (correctness + gated length)\n- **Tags**: math, single-turn, think, boxed-answer, multi-reward\n\n### Datasets\n- **Primary dataset(s)**: [openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k)\n- **Source links**: [HuggingFace](https://huggingface.co/datasets/openai/gsm8k)\n- **Split sizes**: train (7,473), test (1,319)\n\n### Task\n- **Type**: single-turn\n- **Parser**: MaybeThinkParser (handles `<think>...</think>` blocks, extracts `\\boxed{}` answers)\n- **Rubric overview**:\n  - `correct_answer`: Binary (0/1) from math-verify rule-based grading\n  - `length_reward`: Binary (0/1) **GATED on correctness** - only fires when correct AND under threshold\n\n### Gated Length Reward\n\nThe length reward is **gated on correctness** to prevent reward hacking:\n- If answer is **wrong**: length_reward = 0.0 (regardless of length)\n- If answer is **correct AND under threshold**: length_reward = 1.0\n- If answer is **correct but over threshold**: length_reward = 0.0\n\nThis ensures the model can't game the system by generating short incorrect answers.\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval gsm8k-multireward\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval gsm8k-multireward \\\n  -m gpt-4.1-mini \\\n  -n 20 -r 3 -t 1024 -T 0.7 \\\n  -a '{\"length_threshold_tokens\": 1024}'\n```\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `dataset_name` | str | `\"openai/gsm8k\"` | HuggingFace dataset name |\n| `dataset_subset` | str | `\"main\"` | Dataset subset/config |\n| `dataset_split` | str | `\"train\"` | Dataset split |\n| `system_prompt` | str \\| None | `None` | Optional system prompt |\n| `length_threshold_tokens` | int \\| None | `1024` | Token threshold for gated length reward (None to disable) |\n| `length_reward_weight` | float | `1.0` | Weight for length reward in final score |\n| `chars_per_token` | float | `4.0` | Estimated characters per token |\n| `math_verify_max_workers` | int | `10` | Thread pool size for math verification |\n| `math_verify_timeout` | float | `5.0` | Timeout for math verification (seconds) |\n\n### Usage for GRPO vs GDPO Comparison\n\n**GDPO (per-reward normalization):**\n```toml\n[[orchestrator.env]]\nid = \"gsm8k-multireward\"\nargs = { length_threshold_tokens = 1024 }\nreward_keys = [\"correct_answer\", \"length_reward\"]\nreward_weights = [1.0, 1.0]\n```\n\n**GRPO (single aggregated reward):**\n```toml\n[[orchestrator.env]]\nid = \"gsm8k-multireward\"\nargs = { length_threshold_tokens = 1024 }\n# No reward_keys = uses single aggregated reward\n```\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Weighted sum of all reward functions |\n| `correct_answer` | Binary correctness from math verification (0 or 1) |\n| `length_reward` | Gated binary length reward (1 if correct AND under threshold) |\n\n### Example Output\n\nAfter scoring, `state[\"metrics\"]` contains the individual rewards:\n\n```python\n# Correct and concise\nstate[\"metrics\"] = {\"correct_answer\": 1.0, \"length_reward\": 1.0}\nstate[\"reward\"] = 2.0\n\n# Correct but verbose\nstate[\"metrics\"] = {\"correct_answer\": 1.0, \"length_reward\": 0.0}\nstate[\"reward\"] = 1.0\n\n# Incorrect (length reward gated off)\nstate[\"metrics\"] = {\"correct_answer\": 0.0, \"length_reward\": 0.0}\nstate[\"reward\"] = 0.0\n```\n\n### References\n\n- [GSM8K Paper](https://arxiv.org/abs/2110.14168) - Training Verifiers to Solve Math Word Problems\n- [GDPO](https://arxiv.org/abs/2402.03300) - Group Decoupled Policy Optimization\n","encoding":"utf-8","truncated":false,"total_bytes":3598},"status":null}