{"data":{"kind":"file","path":"README.md","version_id":"e9udyz3qq0be8ju7rq565bj4","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4345,"modified_at":"2026-02-20T23:11:44.213000","content_hash":"054dcf79a634015d303fa1d06c3146f5516729e71645205950fdd80bf5d238d5"},"entries":[],"content":"# MTSamples Replicate Benchmark\n\n## Overview\n- **Environment ID**: `mtsamples_replicate`\n- **Short description**: Given patient notes with the PLAN section removed, generate a reasonable treatment plan. Evaluation adapted from HELM's MTSamples Replicate scenario.\n- **Tags**: medical, clinical, single-turn, summarization, llm-judge, eval\n\n## Datasets\n- **Primary dataset**: MTSamples medical transcription (processed)\n- **Source links**: [GitHub](https://github.com/raulista1997/benchmarkdata/tree/main/mtsamples_processed), [HELM scenario](https://github.com/stanford-crfm/helm/blob/main/src/helm/benchmark/scenarios/mtsamples_replicate_scenario.py)\n- **Split sizes**: Evaluation only (no predefined train/test splits)\n\n## Task\n- **Type**: Single-Turn\n- **Rubric overview**: MultiJudgeRubric (LLM-as-a-Judge evaluation adapted from HELM's MTSamples Replicate annotator)\n- **Task description**: Given patient notes with the **PLAN section removed** (while preserving SUMMARY and FINDINGS), generate a reasonable treatment plan.\n- **Prompt**: \"Here are information about a patient, return a reasonable treatment plan for the patient.\"\n- **Evaluation dimensions**:\n  - **Accuracy** (1–5): Does the response provide clinically appropriate and correct treatment guidance?\n  - **Completeness** (1–5): Does the response cover the key aspects of care implied by the note?\n  - **Clarity** (1–5): Is the response clearly written and well structured for clinical readability?\n\n## Quickstart\n\nRun an evaluation with default settings:\n\n```bash\nprime eval run mtsamples_replicate -m \"openai/gpt-5-mini\" -n 5 -s\n```\n\nRun a single-judge evaluation:\n\n```bash\nmedarc-eval mtsamples_replicate -m \"openai/gpt-5-mini\" -n 5 --judge-model \"openai/gpt-5-mini\"\n```\n\nThis environment supports multi-judge via the `medarc-verifiers` `MultiJudge` rubric. When multiple judge models are specified, each scores independently and the final reward is their mean. `judge_model`, `judge_base_url`, and `judge_api_key` are all list-valued and correspond positionally; if only one `judge_base_url` or `judge_api_key` is provided, it is used for all judge models.\n\nUse configured default judges:\n\n```bash\nmedarc-eval mtsamples_replicate -m \"openai/gpt-5-mini\" --judge-model \"openai/gpt-5-mini\" --judge-model \"x-ai/grok-4.1-fast\"\n```\n\n## Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `cache_dir` | str \\| Path \\| None | `~/.cache/medarc/mtsamples_replicate` | Local directory to cache downloaded datasets. |\n| `use_think` | bool | `False` | Whether to use chain-of-thought prompting with `<think>...</think>` |\n| `judge_model` | str \\| list[str] | `\"gpt-4o-mini\"` | Model(s) used by the LLM judge |\n| `judge_base_url` | str \\| list[str] \\| None | `None` | Custom API base URL(s) for judge model |\n| `judge_api_key` | str \\| list[str] \\| None | `None` | API key(s) for judge model |\n\n## Data Processing (HELM-aligned)\n\nFollowing **HELM's MTSamples Replicate approach**:\n\n1. **Section Extraction**: Extracts the first line following `PLAN:`, `SUMMARY:`, or `FINDINGS:` (priority order: `PLAN > SUMMARY > FINDINGS`)\n2. **Input Cleaning**: Removes **only the `PLAN:` section** from the input text; all other sections (e.g., SUMMARY, FINDINGS, IMPRESSION) are preserved as clinical context.\n3. **Reference**: The extracted section content is used as the gold reference answer.\n\n## Dataset Example\n\n**1-year-old Exam – H&P**\n\n**Input (PLAN removed):**\n```\nMedical Specialty: Pediatrics - Neonatal\nSample Name: 1-year-old Exam - H&P\nDescription: Health maintenance exam for 1-year-old female.\n...\nIMPRESSION:\nRoutine well child care. Acute conjunctivitis.\n```\n\n**Reference Answer (PLAN):**\n```\nDiagnostic & Lab Orders: Ordered blood lead.\n```\n\n## Notes\n\n- The `question` field contains the full patient note with the **PLAN section removed**\n- The `answer` field contains the **first line** after the selected section header\n- SUMMARY and FINDINGS may remain in the input, consistent with HELM's Replicate benchmark\n- Scores are normalized to `[0, 1]` by averaging normalized dimension scores\n\n## References\n\n```bibtex\n@misc{helm2023,\n  title={Holistic Evaluation of Language Models},\n  author={Liang, Percy and Bommasani, Rishi and Lee, Tony and others},\n  year={2023},\n  url={https://github.com/stanford-crfm/helm}\n}\n```\n","encoding":"utf-8","truncated":false,"total_bytes":4345},"status":null}