{"data":{"kind":"file","path":"README.md","version_id":"drbprkl0bxkjt6xvekz4sd9q","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4590,"modified_at":"2026-03-18T18:42:33.209000","content_hash":"b2e5989fe439e084b0f4bb34dd9116fd334019cd2747956a6c6276460def68c8"},"entries":[],"content":"# medphysbench-dvh\n\n### Overview\n- **Environment ID**: `medphysbench-dvh`\n- **Short description**: DVH Treatment Plan Evaluation — clinical treatment planning assessment for radiation therapy\n- **Tags**: medical-physics, treatment-planning, dvh, single-turn, think, clinical\n\n### Datasets\n- **Primary dataset(s)**: 166 synthetic radiation therapy treatment plans with DVH statistics\n- **Source**: Programmatically generated using published RTOG/QUANTEC dose constraints\n- **Split sizes**: train (110), dev (28), test (28)\n- **Clinical sites**: Head & Neck (41), Prostate (40), Lung SBRT (42), Brain GBM (43)\n\n### Task\n\nModels review treatment plan DVH (Dose-Volume Histogram) statistics and perform clinical assessment across four task types:\n\n| Task Type | Count | Description | Answer Format |\n|-----------|-------|-------------|---------------|\n| `constraint_check` | 82 | Evaluate whether plan meets all dose constraints | `\\boxed{PASS}` or `\\boxed{FAIL}` |\n| `plan_approval` | 40 | Decide whether plan should be approved for treatment | `\\boxed{APPROVE}` or `\\boxed{REJECT}` |\n| `error_detection` | 25 | Identify a deliberate error in the plan | `\\boxed{ERROR: <description>}` |\n| `plan_comparison` | 19 | Choose the superior plan from two options | `\\boxed{PLAN A}` or `\\boxed{PLAN B}` |\n\n- **Type**: single-turn\n- **Parser**: DVHParser (extracts boxed verdicts with task-specific parsing)\n- **Rubric overview**: Task-dependent reward:\n  - Constraint check: 50% verdict correctness + 50% individual constraint identification\n  - Plan approval: Binary (1.0 correct, 0.0 incorrect)\n  - Error detection: 1.0 correct identification, 0.3 generic detection, 0.0 missed\n  - Plan comparison: Binary (1.0 correct, 0.0 incorrect)\n\n### Quickstart\n\nRun an evaluation with default settings:\n\n```bash\nprime env eval medphysbench-dvh\n```\n\nOr with vf-eval:\n\n```bash\nuv run vf-eval medphysbench-dvh\n```\n\nConfigure model and sampling:\n\n```bash\nprime env eval medphysbench-dvh -m gpt-4.1 -n 28 -r 4\n```\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `split` | str | `\"train\"` | Dataset split: train, dev, or test |\n| `eval_split` | str | `\"test\"` | Evaluation split (set to `\"\"` to disable) |\n| `task_type` | str | `\"all\"` | Filter: all, constraint_check, plan_approval, error_detection, plan_comparison |\n| `site` | str | `\"all\"` | Filter: all, head_and_neck, prostate, lung_sbrt, brain_gbm |\n| `format_reward_weight` | float | `0.0` | Weight for format shaping reward (think tags + boxed answer) |\n| `fewshot_k` | int | `0` | Number of few-shot examples (0 = zero-shot, max 5) |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Task-dependent correctness score (see rubric above) |\n| `val_reward` | Same metric computed on the eval split |\n\n### Clinical Sites and Constraints\n\n**Head & Neck** (70 Gy / 35 fx, IMRT): PTV70, PTV56, spinal cord, brainstem, parotids, mandible, oral cavity, larynx. Constraints from RTOG 0615/0920.\n\n**Prostate** (79.2 Gy / 44 fx, IMRT): PTV, rectum, bladder, femoral heads, penile bulb. Constraints from RTOG 0126/0815.\n\n**Lung SBRT** (50 Gy / 5 fx, SBRT): PTV, lungs-GTV, heart, esophagus, spinal cord, chest wall, great vessels, trachea/bronchus. Constraints from RTOG 0617/0813.\n\n**Brain GBM** (60 Gy / 30 fx, IMRT): PTV, brainstem, optic nerves, optic chiasm, cochlea, lens, retina, hippocampus. Constraints from RTOG 0825.\n\n### Error Types (error_detection tasks)\n\n| Error Type | Description |\n|------------|-------------|\n| `laterality_swap` | Left/right structure labels swapped |\n| `dose_scaling` | Values reported in wrong units (e.g., cGy vs Gy) |\n| `impossible_dvh` | Physically impossible DVH statistic (e.g., Dmean > Dmax) |\n| `wrong_structure` | Structure listed that doesn't belong to the treatment site |\n\n### Baseline Results (Test Split, n=28)\n\n| Model | Zero-shot | 3-shot |\n|-------|-----------|--------|\n| Qwen3-4B | **77.7%** | 71.5% |\n| SmolLM3-3B | 68.7% | **75.7%** |\n| Nemotron-7B | 56.8% | 58.4% |\n| OLMo-3-7B | 57.1% | 54.6% |\n| Llama-3.2-3B | 56.5% | 51.6% |\n\n### Prompt Format\n\nSystem prompt instructs models to act as a board-certified medical physicist reviewing a treatment plan:\n1. Analyze DVH statistics for each structure\n2. Compare against specified dose constraints\n3. Reason inside `<think>...</think>` tags\n4. Provide final answer in `\\boxed{...}` format\n\n### Citation\n\n```bibtex\n@misc{medphysbench_dvh2026,\n  title={MedPhysBench-DVH: A Benchmark for Treatment Plan Evaluation in Medical Physics},\n  author={MedPhysBench Team},\n  year={2026}\n}\n```\n","encoding":"utf-8","truncated":false,"total_bytes":4590},"status":null}