{"data":{"kind":"file","path":"README.md","version_id":"bzjjvhj2nt054j3l0eaniprr","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4017,"modified_at":"2026-05-26T06:05:34.603000","content_hash":"be1359876a61409de84e4216700fadd97fce6b7ab7cd347965bb9a9c919461b9"},"entries":[],"content":"# PhysGym Arena Medley\n\n## Overview\n\n- **Environment name**: physgym-arena-medley\n- **Public Prime link**: https://app.primeintellect.ai/dashboard/environments/kishanpb/physgym-arena-medley-public\n- **Public install**: `prime env install kishanpb/physgym-arena-medley-public`\n- **Slice**: medium-hard Gym simulator repair\n- **Default dataset**: medium_hard\n- **Recommended eval**: 256K total completion tokens, 3 turns, prompt_context=auto\n\nPhysGym Arena Medley is the achievable curriculum slice. It checks whether a model can repair localized Gym 0.26 simulator/API regressions when the broken source is visible. It is intentionally easier than DR-hard, agentic-hard, and capability, and is useful as a positive-control leaderboard row.\n\n## Why Share This Slice\n\nMedley is the best first environment to share with Proximal, Eka, or similar eval/agent teams because it is not a dead benchmark. The current 256K-token, three-turn rows include both full solves and partial failures: GPT-5.5, Claude Opus 4.7, Qwen3.5-397B-A17B, and Kimi2.6 solve the five-row window, while Qwen3-Max solves 3/5 and Qwen3.6-35B-A3B solves 4/5.\n\nThe companion process artifact is `reports/autoresearch_progress.svg`. It is a benchmark-development trace, not a model-performance chart: 200 deterministic design experiments, 33 kept checkpoints, and explicit checks for validation coverage, score accounting, public/private boundary integrity, and release hygiene.\n\n## What Reward Means\n\nA reward of **1.0** is a full executable pass: public tests, hidden metamorphic checks, reference rollout signatures, API checks, and patch hygiene all passed. The pass threshold remains **0.6**.\n\nThe v0.3 scorer is more granular than earlier releases. Public and hidden test groups are fractional when there are multiple checks, and reference rollout credit is based on leaf-level signature agreement rather than a single all-or-nothing comparison.\n\n## Prime Protocol\n\n    prime eval run physgym_arena_medley \\\n      -p prime -m openai/gpt-5.5 \\\n      -n 50 -r 1 -c 1 -t 256000 -T 0.2 \\\n      -a '{\"dataset\":\"medium_hard\",\"prompt_context\":\"auto\",\"turn_budget\":3,\"total_completion_token_budget\":256000}'\n\nProvider errors, runner errors, and API no-run rows should be excluded from pass-rate denominators.\n\n## Current Leaderboard Rows\n\nThese rows use 256K total completion tokens, three revision turns, temperature 0.2, and the v0.3 granular scorer. Provider/no-run rows are excluded.\n\n| Model | Eval ID | Scored rows | Pass@0.6 | Mean reward | Reward vector | Trace note |\n| --- | --- | ---: | ---: | ---: | --- | --- |\n| GPT-5.5 | pezi2ugm6lf7hch3cledknbx | 5 | 5/5 | 1.0000 | 1, 1, 1, 1, 1 | Full-pass positive-control row; diff sizes 1697-2282 chars, 6-6 hunks, hashes 0e80e6c10c, 6607e3633a, feabe6c33d... |\n| Claude Opus 4.7 | okt0mxfrwgnlo4qht2qlk0qd | 5 | 5/5 | 1.0000 | 1, 1, 1, 1, 1 | Full-pass positive-control row; diff sizes 1664-2252 chars, 6-6 hunks, hashes 72a34468bf, aa56f8f45a, 02602ffe20... |\n| Qwen3-Max | du4r8rmdbkroc71yg9b6wnrb | 5 | 3/5 | 0.8043 | 1, 1, 0.5263, 1, 0.495 | Partial medley row; reward vector 1, 1, 0.5263, 1, 0.495 and hashes 3bed5766c0, 4b07354641, ab66f2c8ff. |\n| Qwen3.6-35B-A3B | eiig3vj2mx3k6234bzjjshlc | 5 | 4/5 | 0.9188 | 1, 1, 0.5939, 1, 1 | Partial medley row; reward vector 1, 1, 0.5939, 1, 1 and hashes 3f2bf8e3c6, 0cd08b5405, ee16ea9572. |\n| Qwen3.5-397B-A17B | buy064rf5fdnniwdp38jcpfw | 5 | 5/5 | 1.0000 | 1, 1, 1, 1, 1 | Full-pass positive-control row; diff sizes 1661-2279 chars, 6-6 hunks, hashes 12e0290c02, ae2b4cb219, 701729e281... |\n| Kimi2.6 | ln643p2rsfz80f14a68d5tyo | 5 | 5/5 | 1.0000 | 1, 1, 1, 1, 1 | Full-pass positive-control row; diff sizes 1472-2173 chars, 6-6 hunks, hashes 561ab361ae, 086c8059d2, 071cc70df8... |\n\n## Local Checks\n\n    python -m pytest -q\n    python scripts/check_release_boundary.py --root .\n\n## Included Evidence\n\n- reports/benchmark_card.json\n- reports/prime_upload_leaderboard_v03.md\n- reports/model_trace_summary_256k.md\n- reports/autoresearch_progress.svg\n","encoding":"utf-8","truncated":false,"total_bytes":4017},"status":null}