{"data":{"kind":"file","path":"README.md","version_id":"ikn1zxy7fnp9ejn2dm6wgimy","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4546,"modified_at":"2026-04-13T17:37:21.943000","content_hash":"db6e0216eeb311deb4969e0946ca1c3e3babe2f9f496f56653b4b87d975e8572"},"entries":[],"content":"# hendrycks-math\n\n<a href=\"https://github.com/PrimeIntellect-ai/research-environments/tree/main/environments/hendrycks_math\">\n<img src=\"https://img.shields.io/badge/GitHub-181717?style=for-the-badge&logo=github&logoColor=white\" alt=\"Source Code\">\n</a>\n\n### Overview\n- **Environment ID**: `hendrycks-math`\n- **Short description**: Hendrycks MATH evaluation matching OLMES `minerva_math::tulu` methodology, used to evaluate OTA Tulu-3 specialist SFT checkpoints in prime-rl.\n- **Tags**: math, eval, single-turn\n\n### Provenance\n- Reproduces the OLMES `tulu_3_dev` task suite's `minerva_math::tulu` configuration.\n- Integrated into prime-rl for evaluating OTA Tulu-3 specialist SFT checkpoints (math-reasoning, coding, general, knowledge-recall, precise-if) alongside the `tulu3_ref` baseline.\n\n### Reference model & reproduction\n\n**Reference SFT model**: [`allenai/Llama-3.1-Tulu-3-8B-SFT`](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-SFT) — the Tulu-3 paper's 8B SFT checkpoint built on Llama-3.1-8B.\n\n| Score | Tulu-3 paper | Δ |\n|------:|-------------:|---:|\n| **34.0%** | 31.5% | +2.5pp |\n\n- **Metric**: boxed-answer match (MathRubric, slightly lenient)\n- **Sampling**: greedy, n=5000 across 7 subjects\n\nThis environment was built to reproduce the OLMES `tulu_3_dev` evaluation methodology faithfully, so the Tulu-3 SFT reference number above is the canonical \"is the env correct?\" sanity check. Numbers within ±2pp of the paper indicate the env is correctly mirroring OLMES.\n\n### Datasets\n- **Primary dataset(s)**: `EleutherAI/hendrycks_math` (all 7 subject configs)\n- **Source links**: [HF](https://huggingface.co/datasets/EleutherAI/hendrycks_math)\n- **Split sizes**: 5000 test problems across 7 subjects: `algebra`, `counting_and_probability`, `geometry`, `intermediate_algebra`, `number_theory`, `prealgebra`, `precalculus`\n\n### Task\n- **Type**: single-turn, math, eval\n- **Parser**: `MaybeThinkParser(extract_boxed_answer)` — extracts the final `\\boxed{...}` answer, tolerating an optional `<think>...</think>` prefix\n- **Rubric overview**: `MathRubric` — awards `correct` when the extracted answer matches the gold answer via math-verify (sympy equivalence with fraction/surface-form normalization)\n\n### OLMES methodology faithfulness\nThis environment mirrors OLMES `minerva_math::tulu` exactly:\n- **4-shot Minerva fewshot** drawn from OLMES `Minerva:MATH` (shared `olmes_fewshot.json`)\n- `fewshot_as_multiturn=True` — each shot is a separate `user`/`assistant` turn in the chat prompt\n- Query format: `\"Problem:\\n{problem}\\n\\nSolution:\"` (Minerva CoT style)\n- **No `assistant_prefix`** — OLMES `minerva_math::tulu` explicitly sets this to `None`, so the model generates freely from an empty assistant turn\n- **Empty `stop_sequences`** — relies on the chat template's natural stop tokens (e.g. `<|im_end|>`)\n- Assistant fewshot target is `\" \" + solution` (matches OLMES `doc_to_target`)\n- Gold answer is extracted from `\\boxed{...}` in each example's reference solution\n\n### MathRubric leniency caveat\nOLMES uses strict string-match against the extracted boxed answer. `MathRubric` here uses `math-verify` sympy equivalence, which normalizes fractions, equivalent surface forms, and units. In practice this adds roughly **+2 to +3 percentage points** above what the same completions would score under OLMES. Keep this in mind when comparing against published OLMES numbers (e.g. the Tulu-3 paper).\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nprime eval run hendrycks-math\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `system_prompt` | str or `None` | `None` | System prompt shown to the model |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `correct` | 1.0 if the extracted `\\boxed{...}` answer is equivalent to the gold answer (via `MathRubric` / `math-verify`), else 0.0 |\n\n### Reference Results\n\nOTA Tulu-3 SFT checkpoints evaluated with this environment:\n\n| Model | Accuracy | Notes |\n| ----- | -------- | ----- |\n| `tulu3_ref` | 34.0% | Paper reports 31.5% under OLMES; Δ+2.5pp from MathRubric leniency |\n| `math_reasoning` specialist | **40.9%** | Best across all 6 models — specialization works on math |\n| `coding` specialist | 23.6% | |\n| `general` specialist | 26.0% | |\n| `knowledge_recall` specialist | 23.1% | Partial (4141/5000 problems completed; rest timed out) |\n| `precise_if` specialist | 24.3% | |\n","encoding":"utf-8","truncated":false,"total_bytes":4546},"status":null}