{"data":{"kind":"file","path":"README.md","version_id":"b49hdaf67rtvx2eoxe7f2qaf","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":5039,"modified_at":"2026-04-13T17:37:42.126000","content_hash":"72756a2a6ec0d18e0a2c2eb25f9c0fc98e16782ad0593280e8ae2182e3204f75"},"entries":[],"content":"# humaneval\n\n### Overview\n- **Environment ID**: `humaneval`\n- **Short description**: HumanEval code-generation evaluation matching the OLMES `tulu_3_dev` `codex_humaneval::tulu` task configuration.\n- **Tags**: code, eval, single-turn\n\n### Provenance\nReproduces the OLMES `codex_humaneval::tulu` task used to evaluate Tulu-3 checkpoints (Chen et al. 2021, *Evaluating Large Language Models Trained on Code*). Integrated into prime-rl to evaluate OTA Tulu-3 specialist SFT checkpoints alongside the other `tulu_3_dev` suite environments.\n\n### Reference model & reproduction\n\n**Reference SFT model**: [`allenai/Llama-3.1-Tulu-3-8B-SFT`](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-SFT) — the Tulu-3 paper's 8B SFT checkpoint built on Llama-3.1-8B.\n\n| Score | Tulu-3 paper | Δ |\n|------:|-------------:|---:|\n| **84.9% (pass@10)** | 86.2% | -1.3pp |\n\n- **Metric**: subprocess pass/fail\n- **Sampling**: temp=0.7, 20 rollouts × 164 problems\n\nThis environment was built to reproduce the OLMES `tulu_3_dev` evaluation methodology faithfully, so the Tulu-3 SFT reference number above is the canonical \"is the env correct?\" sanity check. Numbers within ±2pp of the paper indicate the env is correctly mirroring OLMES.\n\n### Datasets\n- **Primary dataset**: `openai_humaneval`, `test` split\n- **Source**: [HF](https://huggingface.co/datasets/openai_humaneval), [GitHub](https://github.com/openai/human-eval)\n- **Split size**: 164 problems\n\n### Task\n- **Type**: single-turn, code, eval\n- **Parser**: `MaybeThinkParser` (strips `<think>…</think>` if present)\n- **Rubric**: `code_correctness` — subprocess execution of extracted code + dataset `check` function; 1.0 if `returncode == 0`, else 0.0\n\n### OLMES methodology faithfulness\n- **Subprocess execution with hard timeout.** Each candidate is run via `subprocess.run([sys.executable, \"-c\", code], timeout=5.0)`. Subprocess is chosen over thread-based execution because threads cannot be killed on infinite loops and hang the eval; subprocess death is clean.\n- **Code extraction handles three response shapes**:\n  1. Markdown block: ```` ```python ... ``` ````\n  2. Plain code (response starts with `def`/`from`/`import` or leading indentation)\n  3. Explanation text followed by the function definition\n- **Prompt preamble is preserved on full-function rewrites.** When the model emits the full `def <entry_point>` (instead of just the body), the extractor prepends the prompt's preamble (imports such as `from typing import List`, helper defs). Without this, ~7-8% of otherwise-correct responses fail with `NameError: name 'List' is not defined`.\n- **Stop sequences for the plain-code path**: `[\"\\nclass \", \"\\ndef \", \"\\nif \", \"\\nprint(\", \"\\n#\", \"\\n```\"]`. Applied only past the prompt boundary so the target function is never truncated.\n- **Recommended sampling**: temperature 0.7, 20 rollouts per problem, score as **pass@10** (per OLMES). Greedy `temperature=0` is also supported for pass@1.\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nprime eval run humaneval\n```\n\nRun pass@10 with 20 samples at `T=0.7`:\n\n```bash\nprime eval run humaneval -n 20 -T 0.7\n```\n\nOverride the execution timeout:\n\n```bash\nprime eval run humaneval -a '{\"timeout\": 10.0}'\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `system_prompt` | str or `None` | `None` | System prompt shown to the model. |\n| `timeout` | float | `5.0` | Per-test subprocess execution timeout, in seconds. |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `code_correctness` | 1.0 if all tests in the dataset `check` function pass (subprocess exits 0), else 0.0. |\n\n### Reference results\n\nOTA Tulu-3 specialist SFT checkpoints evaluated inside prime-rl. pass@1 is greedy (`T=0`); pass@10 is estimated from 20 samples at `T=0.7`.\n\n| Checkpoint | pass@1 (greedy) | pass@10 (T=0.7, n=20) |\n| ---------- | --------------- | --------------------- |\n| `tulu3_ref` | — | **84.9%** (paper 86.2%, Δ −1.3) |\n| `coding` | — | 77.0% |\n| `precise_if` | — | 77.7% |\n| `general` | — | 76.5% |\n| `knowledge_recall` | — | 7.4% |\n| `math_reasoning` | — | 10.6% |\n\nThe `knowledge_recall` specialist collapses because its terse QA voice rarely emits a function body; `math_reasoning` shows catastrophic forgetting on code.\n\n### Notable extractor behavior\nMany instruction-tuned models rewrite the full function signature rather than continue the prompt. Naively evaluating the response alone drops any imports declared in the HumanEval prompt preamble (e.g. `from typing import List`), causing `NameError` on the test harness despite a correct implementation. The extractor detects this case by matching the prompt's `def <entry_point>` line in the response and prepends only the preamble (prompt text up to but excluding the target `def`). Plain-body continuations fall back to prepending the entire prompt, as in the reference HumanEval harness.\n","encoding":"utf-8","truncated":false,"total_bytes":5039},"status":null}