{"data":{"kind":"file","path":"README.md","version_id":"bcp4q28iikg20dyiv4mno0vg","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":6950,"modified_at":"2026-02-20T23:11:44.205000","content_hash":"ef882f1c660416f49cf8faa6285080c1f189f0649b17f6980f43c2adf5f868d9"},"entries":[],"content":"# MedRBench\n\n## Overview\n- **Environment ID**: `medrbench`\n- **Short description**: Medical reasoning benchmark for diagnosis and treatment planning on rare disease cases.\n- **Tags**: medical, diagnosis, treatment, rare-disease, llm-judge, single-turn, multi-turn, eval\n\n## Datasets\n- **Primary dataset**: MedRBench\n- **Source links**: [Paper](https://arxiv.org/abs/2402.09764), [GitHub](https://github.com/MAGIC-AI4Med/MedRBench)\n- **Split sizes**:\n\n| Split | Cases | Rare Disease Cases |\n|-------|-------|-------------------|\n| diagnosis | 957 | 491 |\n| treatment | 496 | 165 |\n\n## Task\n- **Type**: diagnosis supports `oracle`, `1turn`, and `free_turn`; treatment is `oracle` only\n- **System Prompt**: `\"You are a professional doctor\"` (matching original)\n- **Rubric overview**: MultiJudgeRubric (LLM-as-a-Judge evaluation using original MedRBench prompts)\n- **Evaluation metric**: Binary accuracy (Correct/Wrong)\n\n### Diagnosis Split (`outcome_accuracy`)\nThe model is given a clinical case summary and must provide the final diagnosis. The LLM judge uses the original MedRBench `acc_diagnose.txt` prompt to evaluate if the predicted diagnosis matches the ground truth, accounting for:\n- Disease aliases (e.g., \"Heart disease\" = \"Cardiac disease\")\n- Language variations (e.g., \"heart attack\" = \"myocardial infarction\")\n- Partial matches where additional complications are mentioned\n\n### Treatment Split (`treatment_final_accuracy`)\nThe model is given a clinical case summary and must provide a treatment recommendation. The LLM judge uses the original MedRBench `acc_treatment_plan.txt` prompt to evaluate if the predicted treatment is clinically appropriate, considering:\n- Semantic equivalence between predicted and ground truth treatments\n- Valid alternative treatment approaches\n- Additional care measures that don't contradict the main treatment\n\n## Quickstart\nRun an evaluation with default settings (all splits combined):\n\n```bash\nprime eval run medrbench -m \"openai/gpt-5-mini\" -n 5 -s\n```\n\nConfigure model, split, and other options:\n\nSingle-judge example:\n\n```bash\nmedarc-eval medrbench -m \"openai/gpt-5-mini\" -n 50 --judge-model \"openai/gpt-5-mini\"\n```\n\nThis environment supports multi-judge via the `medarc-verifiers` `MultiJudge` rubric. When multiple judge models are specified, each scores independently and the final reward is their mean. `judge_model`, `judge_base_url`, and `judge_api_key` are all list-valued and correspond positionally; if only one `judge_base_url` or `judge_api_key` is provided, it is used for all judge models.\n\nConfigured multi-judge example, with one change from defaults (`-n 50`):\n\n```bash\nmedarc-eval medrbench -m \"openai/gpt-5-mini\" -n 50 \\\n  --judge-model \"openai/gpt-5-mini\" --judge-model \"google/gemini-3-flash-preview\" \\\n  --patient-agent-model \"openai/gpt-5-mini\"\n```\n\nTreatment split only:\n\n```bash\nmedarc-eval medrbench -m \"openai/gpt-5-mini\" -n 50 --split treatment --judge-model \"openai/gpt-5-mini\" --judge-model \"google/gemini-3-flash-preview\"\n```\n\nRare disease cases only:\n\n```bash\nmedarc-eval medrbench -m \"openai/gpt-5-mini\" -n -1 --split diagnosis --rare-disease-only --judge-model \"openai/gpt-5-mini\" --judge-model \"google/gemini-3-flash-preview\"\n```\n\nFree-turn diagnosis mode (max 5 turns):\n\n```bash\nmedarc-eval medrbench -m \"openai/gpt-5-mini\" -n 50 --split diagnosis --task free_turn --max-turns 5 --judge-model \"openai/gpt-5-mini\" --judge-model \"google/gemini-3-flash-preview\"\n```\n\n## Environment Arguments\n\n| Arg | Type | Default | Description |\n|-----|------|---------|-------------|\n| `split` | str | `all` | Dataset split: `diagnosis`, `treatment`, or `all` (default) |\n| `rare_disease_only` | bool | `False` | If True, only include cases with rare diseases |\n| `task` | str | `oracle` | Diagnosis mode: `oracle`, `1turn`, or `free_turn` (diagnosis only) |\n| `max_turns` | int | `5` | Max turns for `free_turn` diagnosis |\n| `judge_model` | str \\| list[str] | `gpt-5-mini` | Model identifier(s) for the LLM judge (original uses `gpt-4o-2024-11-20`) |\n| `judge_base_url` | str \\| list[str] | `None` | Custom API base URL(s) for judge model |\n| `judge_api_key` | str \\| list[str] | `None` | API key(s) for judge model. Falls back to `JUDGE_API_KEY` or `OPENAI_API_KEY` environment variables |\n| `patient_agent_model` | str | `gpt-5-mini` | Model identifier for the patient agent in multi-turn modes |\n| `patient_agent_base_url` | str | `None` | Custom API base URL for the patient agent (required) |\n| `patient_agent_api_key` | str | `None` | API key for the patient agent (required) |\n| `system_prompt` | str | `\"You are a professional doctor\"` | System prompt (matches original MedRBench) |\n\n## Dataset Size Notes\n\nThis environment evaluates against the full dataset for the selected `split`. Use `-n` in `medarc-eval` to subsample.\n\n## Notes\n\n- The `question` field contains the formatted clinical case with task instructions\n- The `answer` field contains the ground truth diagnosis or treatment plan (also available as `reference_response` in `info`)\n- Judge prompts are taken directly from MedRBench's original evaluation prompts (`acc_diagnose.txt` and `acc_treatment_plan.txt`)\n- Treatment web-search evidence from the original implementation is not used; the judge prompt’s `[Additional Information]` section is left empty.\n- Reward is binary: 1.0 for correct, 0.0 for incorrect (following original logic: `'correct' in evaluation_result.lower()`)\n- Treatment remains oracle-only; `task` applies to diagnosis split only\n- Case metadata (body_category, disorder_category, checked_rare_disease) is available in `info` for analysis\n- Data is loaded directly from the MedRBench GitHub repository\n- Additional information web search removed from treatment judge prompt\n\n## Dataset Examples\n\n**Diagnosis Example:**\n```\nCase Summary:\n- Patient Information: 13-year-old male\n- Chief Complaint: Severe left eye pain\n- History of Present Illness: Eyelid edema, erythema, localized warmth...\n- Physical Examination: Febrile (39.5°C), signs of orbital inflammation\n- Laboratory Findings: Elevated CRP, leukocytosis, neutrophilia\n- Imaging: NCCT and NCMRI showing maxillary sinusitis and epidural empyema\n\nGround Truth Diagnosis:\nOrbital cellulitis secondary to acute sinusitis with epidural empyema\n```\n\n**Treatment Example:**\n```\nCase Summary:\n- Patient Information: 58-year-old man\n- Chief Complaint: Cough worsening over 1 week\n- History: Primary myelofibrosis with interstitial pneumonia\n- Laboratory: Anemia (Hb 81.0 g/L), elevated LDH\n\nGround Truth Treatment:\nJAK2 inhibitor therapy (ruxolitinib)\n```\n\n## References\n\n```bibtex\n@article{medrbench2024,\n  title={MedRBench: A Medical Reasoning Benchmark for Large Language Models},\n  author={MAGIC-AI4Med},\n  journal={arXiv preprint arXiv:2402.09764},\n  year={2024}\n}\n```\n\n## Authors\nThis environment has been put together by:\n- [Hunar Batra](https://github.com/hunarbatra)\n- [Benjamin Warner](https://github.com/warner-benjamin)\n","encoding":"utf-8","truncated":false,"total_bytes":6950},"status":null}