{"data":{"kind":"file","path":"README.md","version_id":"amh88astqycrh3rczmud0bn6","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2310,"modified_at":"2026-01-24T21:33:56.874000","content_hash":"e35128318fe2dc157ce538560fdcb71fe9f4550ca32e52a143d8201642a1efd2"},"entries":[],"content":"# medphysbench-kq\n\n### Overview\n- **Environment ID**: `medphysbench-kq`\n- **Short description**: Medical Physics Knowledge Benchmark - MCQ evaluation for radiation oncology, dosimetry, imaging physics, and radiation protection\n- **Tags**: medical-physics, radiation-oncology, mcq, single-turn, think, boxed-answer\n\n### Datasets\n- **Primary dataset(s)**: MedPhysBench v0.2.0 - 412 medical physics MCQs\n- **Source**: Original curation from medical physics board exam materials\n- **Split sizes**: train (257), dev (40), test (115)\n\n### Task\n- **Type**: single-turn\n- **Parser**: ThinkParser (extracts `\\boxed{A-E}` answers)\n- **Rubric overview**: Binary correctness reward (1.0 correct, 0.0 incorrect)\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nprime env eval medphysbench-kq\n```\n\nOr with vf-eval:\n\n```bash\nuv run vf-eval medphysbench-kq\n```\n\nConfigure model and sampling:\n\n```bash\nprime env eval medphysbench-kq -m gpt-4.1 -n 90 -r 1\n```\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `split` | str | `\"test\"` | Dataset split: train, dev, or test |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | 1.0 if answer matches ground truth, 0.0 otherwise |\n| `accuracy` | Same as reward (exact match on A-E answer) |\n\n### Topics Covered\n\n- Fundamental Physics (11.6%)\n- Treatment Planning (10.0%)\n- Imaging Fundamentals (9.4%)\n- Megavoltage Beams (8.8%)\n- Dosimetry (8.8%)\n- Beam Characteristics (8.8%)\n- Radiation Protection (8.1%)\n- X-ray Production (7.5%)\n- Brachytherapy (6.9%)\n- And 12 more specialized topics\n\n### Baseline Results (Test Split, n=90)\n\n| Model | Accuracy | 95% CI |\n|-------|----------|--------|\n| Claude Opus 4.5 | 98.9% | 94.0-99.8 |\n| GPT-4o | 91.1% | 83.4-95.4 |\n| Qwen 2.5 72B | 87.8% | 79.4-93.0 |\n| Llama 3.3 70B (CoT) | 87.8% | — |\n| Mixtral 8x22B | 85.6% | 76.8-91.4 |\n| Llama 3.3 70B | 82.2% | 73.1-88.8 |\n\n### Prompt Format\n\nSystem prompt instructs models to:\n1. Reason step-by-step inside `<think>...</think>` tags\n2. Provide final answer in `\\boxed{X}` format where X is A, B, C, D, or E\n\n### Citation\n\n```bibtex\n@misc{medphysbench2025,\n  title={MedPhysBench: A Benchmark for Medical Physics Knowledge in Large Language Models},\n  author={MedPhysBench Team},\n  year={2025}\n}\n```\n","encoding":"utf-8","truncated":false,"total_bytes":2310},"status":null}