{"data":{"kind":"file","path":"README.md","version_id":"u0mfqw2ywbyb0odiv897s61m","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":6470,"modified_at":"2026-04-13T17:37:42.140000","content_hash":"41e40f4379545a2867ed91752aac736add502ffd0a27ba4f305e7e65910fe798"},"entries":[],"content":"# bbh\n\n<a href=\"https://github.com/PrimeIntellect-ai/research-environments/tree/main/environments/bbh\">\n<img src=\"https://img.shields.io/badge/GitHub-181717?style=for-the-badge&logo=github&logoColor=white\" alt=\"Source Code\">\n</a>\n\n### Overview\n- **Environment ID**: `bbh`\n- **Short description**: BIG-Bench Hard (Suzgun et al. 2022) — 27 challenging reasoning subtasks spanning logical, spatial, commonsense, algorithmic, and language reasoning, evaluated with 3-shot Chain-of-Thought prompting.\n- **Tags**: `bbh`, `reasoning`, `eval`, `single-turn`, `cot`\n\n### Provenance\n- Reproduction of the OLMES `bbh:cot-v1::tulu` configuration used by the Tulu-3 dev task suite.\n- Integrated with the prime-rl framework to evaluate OTA Tulu-3 specialist SFT checkpoints under an OLMES-faithful methodology.\n- Covers 27 subtasks across ~5 capability clusters: logical reasoning, math/counting, commonsense/social, algorithmic/language, and spatial/object tracking.\n\n### Reference model & reproduction\n\n**Reference SFT model**: [`allenai/Llama-3.1-Tulu-3-8B-SFT`](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-SFT) — the Tulu-3 paper's 8B SFT checkpoint built on Llama-3.1-8B.\n\n| Score | Tulu-3 paper | Δ |\n|------:|-------------:|---:|\n| **69.0%** | 69.7% | -0.7pp |\n\n- **Metric**: per-subtask answer-extraction match\n- **Sampling**: greedy, n=6511 across 27 subtasks\n\nThis environment was built to reproduce the OLMES `tulu_3_dev` evaluation methodology faithfully, so the Tulu-3 SFT reference number above is the canonical \"is the env correct?\" sanity check. Numbers within ±2pp of the paper indicate the env is correctly mirroring OLMES.\n\n### Datasets\n- **Primary dataset**: [`lukaemon/bbh`](https://huggingface.co/datasets/lukaemon/bbh) — all 27 BBH subtasks.\n- **Few-shot source**: `environments/olmes_fewshot.json` (OLMES-provided 3-shot CoT exemplars keyed by `bbh_<task>`).\n- **Split sizes**: 27 subtasks × ~250 problems each = **6511 total evaluation samples**.\n\n### Task\n- **Type**: single-turn, reasoning, eval\n- **Parser**: `MaybeThinkParser` (tolerates optional `<think>` blocks)\n- **Rubric**: `exact_match` — 1.0 if the extracted answer matches the gold target for that subtask (case-insensitive, with MCQ-letter normalization for `(A)`-style answers).\n\n### OLMES methodology faithfulness\n- **3-shot CoT, `bbh:cot-v1` style**: the three OLMES exemplars are **concatenated into a single user message** (preceded by `Q: ... A: Let's think step by step.` framing) rather than delivered as separate multi-turn user/assistant exchanges. This matches `cot-v1` (not `cot-v2`, which uses multi-turn few-shot).\n- **Answer extraction**: primary regex `[Tt]he answer is\\s*[:\\s]*(.+?)(?:\\.|$)` to isolate the answer span, followed by a **per-subtask regex** (e.g. `\\([A-Z]\\)` for MCQ subtasks, `[tT]rue|[fF]alse` for `boolean_expressions`, `-?\\d+` for `multistep_arithmetic_two`, etc.) applied to that span.\n- **Scoring**: case-insensitive exact match, with a fallback that normalizes MCQ letter answers (`(A)` ↔ `A`).\n- **Greedy decoding**: temperature 0, matching OLMES defaults.\n\n### Quickstart\nRun an evaluation with default settings (all 27 subtasks, 3-shot CoT):\n\n```bash\nprime eval run bbh\n```\n\nRestrict to a subset of subtasks:\n\n```bash\nprime eval run bbh -a '{\"tasks\": [\"boolean_expressions\", \"date_understanding\", \"word_sorting\"]}'\n```\n\nDisable few-shot (0-shot CoT) for ablation:\n\n```bash\nprime eval run bbh -a '{\"num_shots\": 0}'\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `tasks` | list[str] or `None` | `None` (all 27) | Subset of BBH subtask names to evaluate; `None` uses the full 27-task suite. |\n| `num_shots` | int | `3` | Number of few-shot CoT exemplars prepended per subtask. Set to `0` for zero-shot. |\n| `system_prompt` | str or `None` | `None` | System prompt shown to the model. |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `correct` | 1.0 if the extracted answer matches the gold target under the subtask's regex, else 0.0 (reported via `exact_match`). |\n\n### Subtask coverage (27 total)\n\nGrouped into approximate capability clusters:\n\n- **Logical reasoning (7)**: `boolean_expressions`, `formal_fallacies`, `logical_deduction_three_objects`, `logical_deduction_five_objects`, `logical_deduction_seven_objects`, `web_of_lies`, `causal_judgement`\n- **Math & counting (3)**: `multistep_arithmetic_two`, `object_counting`, `penguins_in_a_table`\n- **Commonsense & social (5)**: `sports_understanding`, `movie_recommendation`, `snarks`, `ruin_names`, `disambiguation_qa`\n- **Algorithmic & language (5)**: `dyck_languages`, `word_sorting`, `hyperbaton`, `salient_translation_error_detection`, `formal_fallacies`\n- **Spatial & object tracking (7)**: `navigate`, `geometric_shapes`, `date_understanding`, `temporal_sequences`, `reasoning_about_colored_objects`, `tracking_shuffled_objects_three_objects`, `tracking_shuffled_objects_five_objects`, `tracking_shuffled_objects_seven_objects`\n\n### Reference results (OTA Tulu-3 specialists)\n\n| Model | BBH macro-avg |\n| ----- | ------------- |\n| `tulu3_ref` (reference SFT) | **69.0%** (paper: 69.7%, Δ −0.7 — close match) |\n| `coding` specialist | ~65% |\n| `general` specialist | ~65% |\n| `knowledge_recall` specialist | ~64% |\n| `precise_if` specialist | ~66% |\n| `math_reasoning` specialist | **16.4%** (catastrophic forgetting) |\n\n### Note on the `math_reasoning` collapse\n\nThe `math_reasoning` SFT checkpoint regresses on **22 of 27 subtasks**, but the regression is **not uniform**:\n\n- **Symbolic subtasks survive** (minor or no regression): `boolean_expressions`, `geometric_shapes`, and other purely formal tasks — the model retained the step-by-step symbolic manipulation style from math fine-tuning.\n- **MCQ-style subtasks collapse** dramatically: `sports_understanding` drops ~70pp, `date_understanding` drops ~63pp. The root cause is that `math_reasoning` training shifted the output convention away from emitting `(A)`-style multiple-choice letters and the \"the answer is X\" trailing pattern — so answer extraction fails even when the underlying reasoning is roughly correct.\n\nThis pattern (capability retained, output convention lost) is an output-format regression rather than a reasoning regression, and is the dominant failure mode for domain-specialized SFT on multi-capability evaluations.\n","encoding":"utf-8","truncated":false,"total_bytes":6470},"status":null}