{"data":{"kind":"file","path":"README.md","version_id":"kg4ofv9z084thgtbf7p7kk6g","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4172,"modified_at":"2026-05-26T06:05:41.524000","content_hash":"c5a4d70da256b8b337a763a98c1718c308aeb09d0549c7ce334fbbf32f9157b9"},"entries":[],"content":"# PhysGym Arena DR-Hard\n\n## Overview\n\n- **Environment name**: physgym-arena-drhard\n- **Public Prime link**: https://app.primeintellect.ai/dashboard/environments/kishanpb/physgym-arena-drhard-public\n- **Public install**: `prime env install kishanpb/physgym-arena-drhard-public`\n- **Slice**: domain-randomized Gym simulator repair\n- **Default dataset**: dr_hard\n- **Recommended eval**: 256K total completion tokens, 3 turns, prompt_context=auto\n\nPhysGym Arena DR-Hard tests whether a model can repair simulator dynamics that must remain correct under unseen physical parameters. Public tests are intentionally weak; hidden randomized contracts and rollout signatures carry most of the signal.\n\n## Why Share This Slice\n\nDR-Hard is the best first stress environment to share with Proximal, Eka, or similar eval/agent teams. It has a clean failure mechanism: hidden checks vary simulator parameters, so restoring nominal constants is not enough. The current 256K-token, three-turn leaderboard window has five valid frontier/Qwen rows and 0/4 full solves for every row, while still giving partial public/API/hygiene credit that makes the failure interpretable.\n\nThe companion process artifact is `reports/autoresearch_progress.svg`. It is a benchmark-development trace, not a model-performance chart: 200 deterministic design experiments, 33 kept checkpoints, and explicit checks for validation coverage, score accounting, public/private boundary integrity, and release hygiene.\n\n## What Reward Means\n\nA reward of **1.0** is a full executable pass. The pass threshold remains **0.6**.\n\nThe v0.3 scorer is granular: public and hidden checks can contribute fractional credit, and reference rollout credit is based on leaf-level signature agreement. This avoids uninformative flat 0.25 rows when a model repairs one environment family but misses the full domain-randomized contract.\n\n## Prime Protocol\n\n    prime eval run physgym_arena_drhard \\\n      -p prime -m openai/gpt-5.5 \\\n      -n 50 -r 1 -c 1 -t 256000 -T 0.2 \\\n      -a '{\"dataset\":\"dr_hard\",\"prompt_context\":\"auto\",\"turn_budget\":3,\"total_completion_token_budget\":256000}'\n\nProvider errors, runner errors, and API no-run rows should be excluded from pass-rate denominators.\n\n## Current Leaderboard Rows\n\nThese rows use 256K total completion tokens, three revision turns, temperature 0.2, and the v0.3 granular scorer. Provider/no-run rows are excluded.\n\n| Model | Eval ID | Scored rows | Pass@0.6 | Mean reward | Reward vector | Trace note |\n| --- | --- | ---: | ---: | ---: | --- | --- |\n| GPT-5.5 | s2yevuf22fkul27ipjccv400 | 4 | 0/4 | 0.4532 | 0.4532, 0.4532, 0.4532, 0.4532 | All rows below pass threshold; scalar rewards collapse hidden-contract misses, while traces differ: 9675-13178 chars, 16-30 hunks, hashes 63b5c1eaee, cc3ac2b760, f711a341c5... |\n| Claude Opus 4.7 | hbo5q61u1gngpu65baom0fcx | 4 | 0/4 | 0.4532 | 0.4532, 0.4532, 0.4532, 0.4532 | All rows below pass threshold; scalar rewards collapse hidden-contract misses, while traces differ: 11466-14129 chars, 11-48 hunks, hashes c5953fd8b4, 775127f01f, 1f6325b5ae... |\n| Qwen3-Max | tisgbo7d1fgb88nqi0ho4r9x | 4 | 0/4 | 0.4521 | 0.4906, 0.45, 0.4147, 0.4532 | All rows below pass threshold; scalar rewards collapse hidden-contract misses, while traces differ: 8779-13259 chars, 30-44 hunks, hashes 631424c465, 5d3bb5e811, 426b6b2b29... |\n| Qwen3.6-35B-A3B | mmssqp3zkdova75t98wf9hsq | 4 | 0/4 | 0.4521 | 0.4906, 0.45, 0.4147, 0.4532 | All rows below pass threshold; scalar rewards collapse hidden-contract misses, while traces differ: 9877-15374 chars, 18-28 hunks, hashes 1f45857912, db102902a3, 8ca398b77a... |\n| Qwen3.5-397B-A17B | lxjtfhyt8ddw1kkwkr1papgo | 4 | 0/4 | 0.4521 | 0.4906, 0.45, 0.4147, 0.4532 | All rows below pass threshold; scalar rewards collapse hidden-contract misses, while traces differ: 2936-7989 chars, 12-24 hunks, hashes 2f2c883c2d, 496b3bdb65, 40c0b3f419... |\n\n## Local Checks\n\n    python -m pytest -q\n    python scripts/check_release_boundary.py --root .\n\n## Included Evidence\n\n- reports/benchmark_card.json\n- reports/prime_upload_leaderboard_v03.md\n- reports/model_trace_summary_256k.md\n- reports/autoresearch_progress.svg\n","encoding":"utf-8","truncated":false,"total_bytes":4172},"status":null}