{"data":{"kind":"file","path":"README.md","version_id":"i01sl66wndkg3nd80p806p8p","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4559,"modified_at":"2026-04-13T17:37:42.133000","content_hash":"9a51a2b615987804e7d419657d0f7592be4f799216ceb97a5cf8be635869acd7"},"entries":[],"content":"# humanevalplus\n\n### Overview\n- **Environment ID**: `humanevalplus`\n- **Short description**: HumanEval+ code generation eval (EvalPlus extension with hardened test cases).\n- **Tags**: code, eval, single-turn\n\n### Provenance\nHumanEval+ (Liu et al. 2023, *\"Is Your Code Generated by ChatGPT Really Correct?\"*) extends the\noriginal OpenAI HumanEval benchmark with significantly more comprehensive test cases that catch\nedge-case bugs missed by the base suite. It belongs to the same OLMES `codex_humaneval` family\nand is used inside the prime-rl framework to evaluate the OTA Tulu-3 specialist SFT checkpoints\nalongside `humaneval`.\n\n### Reference model & reproduction\n\n**Reference SFT model**: [`allenai/Llama-3.1-Tulu-3-8B-SFT`](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-SFT) — the Tulu-3 paper's 8B SFT checkpoint built on Llama-3.1-8B.\n\n| Score | Tulu-3 paper | Δ |\n|------:|-------------:|---:|\n| **68.6% (pass@10)** | 81.4% | -12.8pp |\n\n- **Metric**: subprocess pass/fail (extended tests)\n- **Sampling**: temp=0.7, 20 rollouts × 164 problems\n\nThis environment was built to reproduce the OLMES `tulu_3_dev` evaluation methodology faithfully, so the Tulu-3 SFT reference number above is the canonical \"is the env correct?\" sanity check. Numbers within ±2pp of the paper indicate the env is correctly mirroring OLMES.\n\n### Datasets\n- **Primary dataset**: `evalplus/humanevalplus` (HF), `test` split — 164 problems\n- Each record exposes `prompt`, `entry_point`, `task_id`, and `plus_tests` (the extended test\n  harness). When `plus_tests` is missing on a record, the environment falls back to `test`.\n\n### Task\n- **Type**: single-turn, code, eval\n- **Parser**: `MaybeThinkParser`\n- **Executor**: Each candidate completion is concatenated with the test harness and executed\n  in a Python subprocess (`subprocess.run([sys.executable, \"-c\", ...])`) with a hard timeout.\n- **Default timeout**: `20.0` seconds (HumanEval uses `5.0`; the extended harness is slower).\n\n### Compared to `humaneval`\nThe generation task is identical — same prompts, same function signatures, same extraction\nlogic — but `plus_tests` adds many edge cases (empty inputs, unicode, large numbers, boundary\nconditions, etc.), so a completion that passes HumanEval can still fail here. The HE → HE+ drop\nmeasures a model's robustness to edge cases rather than raw code-gen ability.\n\n### Code extraction (shared with `humaneval`)\nThe `_extract_code` helper is copied verbatim from `humaneval.py` and handles:\n1. Markdown ```` ```python ``` ```` blocks (with or without the full `def`).\n2. Plain code responses starting with `def`/`from`/`import`/indent.\n3. Explanation text followed by a function definition.\n\nWhen the model rewrites the full `def`, the prompt's preamble (imports, helper defs above the\ntarget signature) is preserved so imports remain in scope. Stop sequences\n(`\\nclass `, `\\ndef `, `\\nif `, `\\nprint(`, `\\n#`, `\\n```` ``` ````) truncate trailing junk.\n\n### Quickstart\nRun an evaluation with defaults:\n\n```bash\nprime eval run humanevalplus\n```\n\nRecommended `pass@10` sampling pattern used for the OTA Tulu-3 runs (temperature 0.7, 20 rollouts):\n\n```bash\nprime eval run humanevalplus -a '{\"timeout\": 30.0}' \\\n  --sampling-args '{\"temperature\": 0.7, \"n\": 20}'\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n- `pass@k` is computed over the sampled rollouts by the outer eval harness, not the env itself.\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `system_prompt` | str or `None` | `None` | System prompt shown to the model |\n| `timeout` | float | `20.0` | Per-test subprocess hard timeout (seconds) |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `code_correctness` | `1.0` if extracted code passes every `plus_tests` assertion within `timeout`, else `0.0` |\n\n### Reference results (pass@10, temp=0.7, n=20)\n\n| Model | pass@10 |\n| ----- | ------- |\n| `tulu3_ref` | 68.6% |\n| OTA specialist — `coding` | 62.6% |\n| OTA specialist — `precise_if` | 65.1% |\n| OTA specialist — `general` | 62.0% |\n| OTA specialist — `knowledge_recall` | 8.4% |\n| OTA specialist — `math_reasoning` | 8.6% |\n\nThe `tulu3_ref` result sits 12.8 points below the paper-reported 81.4% HumanEval+ score;\nthe gap reflects the harsher `plus_tests` suite relative to our sampling budget. The\nHumanEval → HumanEval+ drop (84.9 → 68.6 for `tulu3_ref`) is the quantitative signal for\nedge-case fragility that this environment is designed to surface.\n","encoding":"utf-8","truncated":false,"total_bytes":4559},"status":null}