{"data":{"kind":"file","path":"README.md","version_id":"bkcc1habi993xos0msl3uopo","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":13066,"modified_at":"2026-05-30T04:26:34.344000","content_hash":"3b472498b0ccd91bbedf4c481d97b665097ca3de7e9f92231ec0b4f2765e7635"},"entries":[],"content":"# PatchRecoveryGym For Laguna\n\nPatchRecoveryGym measures whether agentic coding models recover from a wrong\nfirst patch. Each task is a tuple of `(problem, wrong_first_attempt,\nhidden_tests)`. The model sees a realistic failed agent transcript and must\nproduce a corrected second attempt as a unified diff.\n\nThe hackathon v0 artifact targets Poolside Laguna XS.2 with a six-task\nUpgradeGym-derived dependency-migration split, hidden executable tests, a\npublic Prime/verifiers environment, and reusable pass@k/reranking artifacts.\n\nThe package-internal Python module and CLI remain named `recoverybench` for\ncompatibility with the existing source tree. The public artifact, Prime\nenvironment, and judge-facing docs use the less ambiguous name\nPatchRecoveryGym.\n\n## Methodology\n\nThe current public split comes from local UpgradeGym-derived dependency\nmigration tasks. Dataset construction records the failed first patch, failed\ntest output, hidden-test patch hash, reference patch hash, and task provenance.\n\nThe older generator can still be used for a planned SWE-bench Verified v1 split.\nThat path samples instances with a deterministic seed and stratifies by repo,\nbut it is not the hackathon headline artifact.\n\nWrong first attempts are grouped into three buckets:\n\n| Bucket | Source | Target count |\n| --- | --- | ---: |\n| A | Laguna XS.2 sampled attempts that apply but fail hidden tests | 50 |\n| B | Deterministic programmatic mutants of gold patches | 50 |\n| C | Smaller cross-model attempts via OpenRouter | 50 |\n\nEvery accepted wrong attempt must apply cleanly and fail hidden tests. Syntactic rejects\nare excluded. Error labels use the fixed taxonomy in `recoverybench.data.schema`; bucket B\nlabels are assigned by the mutation taxonomy, while generated buckets use the classifier in\n`recoverybench.generation.error_types`.\n\n## Install\n\n```bash\nuv sync --python 3.11 --extra dev\n```\n\nFor full SWE-bench Docker evaluation and Prime/verifiers integration:\n\n```bash\nuv sync --python 3.11 --extra dev --extra swebench --extra prime\n```\n\nTo render prompts with Laguna's native Hugging Face chat template, include the optional\ntokenizer extra:\n\n```bash\nuv sync --python 3.11 --extra tokenizer\n```\n\n## Required Environment Variables\n\nAPI keys are read only from environment variables:\n\n```bash\nexport POOLSIDE_API_KEY=...\nexport OPENROUTER_API_KEY=...\nexport ANTHROPIC_API_KEY=...\n```\n\nOptional overrides:\n\n```bash\nexport POOLSIDE_BASE_URL=https://api.poolside.ai/v1\nexport RECOVERYBENCH_LAGUNA_BASE_URL=http://127.0.0.1:11434/v1\nexport RECOVERYBENCH_LAGUNA_API_KEY=ollama\nexport RECOVERYBENCH_LAGUNA_MODEL=laguna-xs.2\nexport LAGUNA_OPENROUTER_MODEL=poolside/laguna-xs-2\nexport RECOVERYBENCH_CEILING_MODEL=anthropic/claude-sonnet-4\nexport RECOVERYBENCH_CEILING_PROVIDER=openrouter\nexport RECOVERYBENCH_PEER_MODEL=qwen/qwen-2.5-coder-32b-instruct\nexport RECOVERYBENCH_CROSS_MODELS=\\\nqwen/qwen-2.5-coder-7b-instruct,\\\ndeepseek/deepseek-coder-6.7b-instruct,\\\nmeta-llama/llama-3.1-8b-instruct\nexport RECOVERYBENCH_JUDGE_MODEL=openai/gpt-4o-mini\nexport RECOVERYBENCH_CACHE=~/.cache/recoverybench\n```\n\nBucket A generation prefers `RECOVERYBENCH_LAGUNA_BASE_URL` when set, which lets\nyou point at a RunPod/Ollama or other OpenAI-compatible Laguna endpoint. Without\nthat override it uses `POOLSIDE_API_KEY`; if that is absent and\n`OPENROUTER_API_KEY` is present, it uses the OpenRouter Laguna model id above.\nThe baseline script uses the same Laguna routing and reads the optional\nreference-model overrides above. To use Anthropic directly for the ceiling model, set\n`RECOVERYBENCH_CEILING_PROVIDER=anthropic` and use a native Claude model id such as\n`RECOVERYBENCH_CEILING_MODEL=claude-sonnet-4-20250514`; this path reads\n`ANTHROPIC_API_KEY`. Bucket C uses `RECOVERYBENCH_CROSS_MODELS` when set; provide a\ncomma-separated list of OpenRouter model ids. Generated Bucket A/C attempts are\nclassified by `RECOVERYBENCH_JUDGE_MODEL` through OpenRouter; the default is\n`openai/gpt-4o-mini`.\n\n## Reproduce\n\nThe intended three-command path is:\n\n```bash\nuv sync --python 3.11 --extra swebench --extra prime\nuv run recoverybench generate --bucket all --n 50 --seed 42 --yes\nuv run recoverybench eval --model poolside/Laguna-XS.2 --split data/recoverybench-150.jsonl --workers 4 --timeout-seconds 600 --seed 42\n```\n\nThe `eval` command resolves `poolside/Laguna-XS.2` to the Poolside endpoint when\n`POOLSIDE_API_KEY` is set, and falls back to the OpenRouter Laguna model when only\n`OPENROUTER_API_KEY` is available. The generator prints an estimated API cost before\nrunning; omit `--yes` if you want the confirmation prompt for runs estimated over $20.\n\nAfter generation, validate the split before spending evaluation time:\n\n```bash\nuv run recoverybench validate --split data/recoverybench-150.jsonl --bucket all --n 50\n```\n\nValidation requires the upstream SWE-bench row metadata needed for the wrapper to make\n`hidden_test_patch` authoritative. For local synthetic task files only, add\n`--allow-missing-swebench-metadata`.\n\nThen build the report:\n\n```bash\nuv run recoverybench report \\\n  --baseline-results results/baseline \\\n  --expected-models poolside/Laguna-XS.2,anthropic/claude-sonnet-4,qwen/qwen-2.5-coder-32b-instruct \\\n  --expected-tasks-per-model 150\n```\n\nFor the full three-model baseline run, use:\n\n```bash\nuv run python scripts/run_baseline.py \\\n  --data-path data/recoverybench-150.jsonl \\\n  --output-dir results/baseline \\\n  --swebench-success-path results/original_swebench_success.json\n```\n\nThe optional SWE-bench success file may be JSON, JSONL, or CSV keyed by `task_id` or\n`instance_id`, with a boolean-like `swebench_success`, `resolved`, `passed`, or\n`success` field. Native SWE-bench run reports with `resolved_ids` and `unresolved_ids`\nare also accepted. For model-specific confusion matrices, pass either a JSON object\nkeyed by model id whose values are success maps/reports, or a directory containing\nper-model `.json`, `.jsonl`, or `.csv` files. Directory filenames use `__` for `/` in\nmodel ids, for example `qwen__qwen-2.5-coder-32b-instruct.json`.\n\nThe generator and evaluator are resumable. Accepted dataset records are saved after each\ntask, and evaluation results are cached by `(task_id, patch_hash)` under\n`~/.cache/recoverybench` unless `RECOVERYBENCH_CACHE` points elsewhere.\n\nUse `--cache-dir <path>` to override the cache location for `generate` or `eval`, and\n`--no-cache` to disable cache reads/writes for a run. Use `--timeout-seconds` on\n`eval` or `scripts/run_baseline.py` to adjust the per-task hidden-test timeout.\n\n## Hackathon v0 Local Split\n\nFor an end-to-end path that does not require provider keys or full SWE-bench Docker\ngeneration, build the local UpgradeGym-derived split:\n\n```bash\nuv run python scripts/generate_upgradegym_split.py\nuv run python scripts/validate_upgradegym_split.py\nuv run python scripts/run_reference_ceiling.py\n```\n\nThis writes:\n\n- `data/recoverybench-upgradegym-6.jsonl`\n- `data/recoverybench-upgradegym-6.manifest.json`\n- `data/recoverybench-upgradegym-6.validation.json`\n- `data/recoverybench-upgradegym-6.validation.md`\n- `results/baseline/reference-solution.jsonl`\n\nEach task records a local `hidden_test_patch`, a hash for that patch, and the reference\nsolution patch hash. This split is a complete hackathon-v0 PatchRecoveryGym artifact,\nnot the planned 150-task SWE-bench Verified v1 split.\n\n## Prime / Verifiers\n\nThe environment loader is `recoverybench.verifiers_env.environment:load_environment` and\nis also exported as `recoverybench.load_environment`.\n\nInstall the public environment through Prime:\n\n```bash\nprime env install kannappan/patchrecoverygym-laguna\n```\n\nFrom this lab workspace, the local source environment can also be installed\nunder the historical local name:\n\n```bash\nprime env install recoverybench\n```\n\nRun the checked-in one-example smoke config:\n\n```bash\nprime eval run configs/eval/recoverybench_smoke.toml\n```\n\nOr run the public environment directly without the TOML:\n\n```bash\nprime eval run kannappan/patchrecoverygym-laguna \\\n  -m prime-qwen3-5-4b \\\n  -n 1 \\\n  -r 1 \\\n  --timeout 900 \\\n  -a '{\"max_turns\": 1, \"workers\": 1, \"timeout_seconds\": 120, \"use_cache\": false}'\n```\n\nThe default split is the packaged six-task `recoverybench-upgradegym-6` local split.\nIt does not require SWE-bench Docker or provider keys to load. Model inference still\nrequires the endpoint credentials in `../configs/endpoints.toml`, for example\n`PRIME_API_KEY` for the default smoke config.\n\nThe older direct `verifiers` CLI path also works from this package directory:\n\n```bash\nuv run vf-eval recoverybench --model poolside/Laguna-XS.2 --num-examples 150\n```\n\nThe environment uses a `ToolEnv` with `apply_patch` and `run_tests` tool names matching\nthe transcript. The default `reward_mode=\"hidden\"` reward is binary and delegates to the\nRecoveryBench harness. For hosted RL or curriculum runs, `reward_mode=\"shaped\"` adds\nintermediate signal while keeping hidden tests withheld: `0.0` for no extractable patch,\n`0.2` for a unified diff that fails to apply, `0.5` for an applied patch that still fails\nhidden tests, and `1.0` for a hidden-test pass. For training-only curricula,\n`reward_mode=\"reference_shaped\"` adds deterministic patch-line similarity to the\npackaged reference solution. Failed-apply patches can receive only a small\nsimilarity bonus, applied-but-failing patches can receive a larger capped bonus,\nand hidden-test success remains the only `1.0` outcome.\nRuntime controls can be passed through `--env-args`, including `workers`,\n`timeout_seconds`, `cache_dir`, `use_cache`, `prompt_variant`, `reward_mode`,\n`repair_failed_patch`, and `repair_min_similarity`.\n\nSet `repair_failed_patch=true` to score through the deterministic apply-aware\nline-repair path. This is opt-in: default scoring remains the raw model patch.\nWhen enabled, the environment keeps patches that pass `git apply --check`\nunchanged; for failed-apply patches on local task fixtures, it infers exact\nline replacements from the candidate diff and scores the regenerated clean diff.\n\nFor example, run the strict prompt with shaped reward:\n\n```bash\nprime eval run kannappan/patchrecoverygym-laguna \\\n  -m poolside/laguna-xs.2 \\\n  -n 6 \\\n  -r 1 \\\n  -a '{\"reward_mode\": \"shaped\", \"prompt_variant\": \"strict_diff\", \"reasoning_enabled\": false, \"workers\": 1, \"timeout_seconds\": 120, \"use_cache\": false}' \\\n  --hosted\n```\n\nTo evaluate with the reasoning-disabled transcript variant:\n\n```bash\nuv run vf-eval recoverybench \\\n  --model poolside/Laguna-XS.2 \\\n  --num-examples 150 \\\n  --env-args '{\"reasoning_enabled\": false}'\n```\n\nFor a small uncached smoke with a shorter hidden-test timeout:\n\n```bash\nuv run vf-eval recoverybench \\\n  --model poolside/Laguna-XS.2 \\\n  --num-examples 5 \\\n  --env-args '{\"workers\": 1, \"timeout_seconds\": 300, \"use_cache\": false}'\n```\n\nSummarize hosted Prime eval samples into pass@k tables:\n\n```bash\nprime --plain eval get <eval-id> --output json > /tmp/eval_get.json\nprime --plain eval samples <eval-id> --output json > /tmp/eval_samples.json\n\nuv run python scripts/summarize_prime_eval_passk.py \\\n  /tmp/eval_samples.json \\\n  --eval-json /tmp/eval_get.json \\\n  --environment-label owner/env@version \\\n  --json-output /tmp/eval_passk_summary.json \\\n  --markdown-output /tmp/eval_passk_summary.md\n```\n\nThe pass@k summarizer uses the standard unbiased estimator per example and is\nintended for hosted best-of-k, retry-policy, and verifier-reranking analysis.\n\nAnalyze whether simple trace-only selectors can convert best-of-k samples into\nsingle chosen patches:\n\n```bash\nuv run python scripts/analyze_prime_eval_reranking.py \\\n  /tmp/eval_samples.json \\\n  --eval-json /tmp/eval_get.json \\\n  --json-output /tmp/eval_reranking_diagnostics.json \\\n  --markdown-output /tmp/eval_reranking_diagnostics.md\n```\n\n## Report Artifacts\n\n`scripts/build_report.py` writes:\n\n- `report/recoverybench_baseline.md`\n- `report/figures/recovery_by_bucket.png`\n- `report/figures/recovery_by_errortype.png`\n- `report/figures/recovery_vs_swebench.png`\n- `report/tables/headline_numbers.md`\n\nThe plotting stack is matplotlib only.\n\n## Dataset Card\n\nExpected v1 split:\n\n| Field | Value |\n| --- | --- |\n| Size | 150 tasks |\n| Buckets | 50 A, 50 B, 50 C |\n| Source | SWE-bench Verified |\n| License | Dataset metadata follows upstream task licenses; PatchRecoveryGym code is project-local |\n| Prompt | Realistic failed agent transcript with wrong patch and failing test output |\n| Score | Binary hidden-test pass on second attempt |\n\nError-type distribution is generated after dataset construction:\n\n```bash\nuv run python - <<'PY'\nfrom collections import Counter\nfrom recoverybench.data.loader import load_recovery_tasks\ntasks = load_recovery_tasks(\"data/recoverybench-150.jsonl\")\nprint(Counter(task.error_type.value for task in tasks))\nPY\n```\n\n## Citations\n\n- SWE-bench: Jimenez et al., \"SWE-bench: Can Language Models Resolve Real-World GitHub Issues?\"\n- SWE-bench Verified dataset: `princeton-nlp/SWE-bench_Verified`\n- Laguna XS.2 model card: `poolside/Laguna-XS.2`\n- Verifiers library: `PrimeIntellect-ai/verifiers`\n","encoding":"utf-8","truncated":false,"total_bytes":13066},"status":null}