{"data":{"kind":"file","path":"README.md","version_id":"yu383ss08vhvdu8m2kjruc24","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4697,"modified_at":"2026-04-13T17:37:21.938000","content_hash":"22e8b7ccbdd89f2cad08f2fe83bfde17f50ae6518172de78d6fda1276fcd1cf7"},"entries":[],"content":"# gsm8k-olmes\n\n<a href=\"https://github.com/PrimeIntellect-ai/research-environments/tree/main/environments/gsm8k_olmes\">\n<img src=\"https://img.shields.io/badge/GitHub-181717?style=for-the-badge&logo=github&logoColor=white\" alt=\"Source Code\">\n</a>\n\n### Overview\n- **Environment ID**: `gsm8k-olmes`\n- **Short description**: GSM8K evaluation matching the OLMES `tulu_3_dev_no_safety` task suite methodology exactly.\n- **Tags**: math, eval, single-turn\n\n### Provenance\nThis environment exists to evaluate the [OTA Tulu-3](https://allenai.org/papers/tulu-3) specialist SFT checkpoints inside the prime-rl framework with results directly comparable to the numbers reported in the Tulu-3 paper. The original Tulu-3 paper uses [OLMES](https://github.com/allenai/olmes) with the `tulu_3_dev_no_safety` task suite for in-development evaluation; this environment is a faithful port of the GSM8K task from that suite into the verifiers API so that specialist merges and ablations can be scored consistently against the published baselines.\n\n### Reference model & reproduction\n\n**Reference SFT model**: [`allenai/Llama-3.1-Tulu-3-8B-SFT`](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-SFT) — the Tulu-3 paper's 8B SFT checkpoint built on Llama-3.1-8B.\n\n| Score | Tulu-3 paper | Δ |\n|------:|-------------:|---:|\n| **76.4%** | 76.2% | +0.2pp |\n\n- **Metric**: exact_match (last-number extraction)\n- **Sampling**: greedy, n=1319\n\nThis environment was built to reproduce the OLMES `tulu_3_dev` evaluation methodology faithfully, so the Tulu-3 SFT reference number above is the canonical \"is the env correct?\" sanity check. Numbers within ±2pp of the paper indicate the env is correctly mirroring OLMES.\n\n### Datasets\n- **Primary dataset**: `gsm8k` (config `main`, split `test`)\n- **Source links**: [HF](https://huggingface.co/datasets/gsm8k), [paper](https://arxiv.org/abs/2110.14168)\n- **Split size**: 1319 test problems\n- **Fewshot**: 8 standard examples loaded from `../olmes_fewshot.json` (sourced from OLMES `fewshot_sources.py` `STD:GSM8k`)\n\n### Task\n- **Type**: single-turn\n- **Parser**: `MaybeThinkParser`\n- **Rubric**: `exact_match` (weight 1.0) comparing the last number extracted from the model response to the gold `#### NUMBER` answer.\n\n### OLMES methodology faithfulness\nThe key to reproducing published numbers is matching OLMES exactly. This environment mirrors the following OLMES choices:\n\n- **8-shot standard fewshot** drawn from OLMES `STD:GSM8k`.\n- **`fewshot_as_multiturn=True`** — each `Question:`/`Answer:` pair becomes a separate user/assistant turn rather than being concatenated into one prompt.\n- **`\"Question: {q}\"` / `\"Answer: {a}\"`** formatting for every turn.\n- **`assistant_prefix=\"Answer:\"`** — the evaluation appends a trailing assistant message whose content is `\"Answer:\"` and sets `extra_body={\"continue_final_message\": True, \"add_generation_prompt\": False}` so that vLLM continues the trailing assistant message instead of starting a fresh one. This is the load-bearing detail: without the prefix, instruction-tuned models drift into verbose chain-of-thought mode and lose roughly 9pp on Tulu-3 specialists.\n- **Last-number extraction** exactly matching OLMES: commas-in-numbers are stripped, then `re.findall(r\"[-+]?\\d*\\.\\d+|\\d+\", text)[-1]` picks the final numeric token.\n- **Greedy decoding** (temperature 0.0).\n- **Stop sequences**: `[\"Question:\", \"</s>\", \"<|im_end|>\"]`.\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nprime eval run gsm8k-olmes\n```\n\nRecommended sampling settings: `temperature=0.0` (greedy) to match OLMES.\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n- The `continue_final_message` sampling arg is set by the environment automatically; don't override `extra_body` unless you know what you're doing.\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `system_prompt` | str or `None` | `None` | Optional system prompt shown to the model. OLMES uses none. |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `exact_match` | 1.0 if the last number in the model response equals the gold answer (after comma stripping and last-number extraction on both sides), else 0.0 |\n\n### Reference results\n\nGreedy decoding, 1319 test problems.\n\n| Checkpoint | GSM8K (this env) | GSM8K (paper) | Δ |\n| ---------- | ---------------: | ------------: | ----: |\n| `tulu3_ref` | 76.4 | 76.2 | +0.2 |\n\nOTA Tulu-3 specialists (prime-rl SFT, same env, greedy):\n\n| Specialist | GSM8K |\n| ---------- | ----: |\n| `math_reasoning` | 77.1 |\n| `precise_if` | 60.9 |\n| `general` | 60.5 |\n| `coding` | 60.2 |\n| `knowledge_recall` | 57.0 |\n","encoding":"utf-8","truncated":false,"total_bytes":4697},"status":null}