{"data":{"kind":"file","path":"README.md","version_id":"wdtm8x14c3chkggujbjz5s3k","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":6669,"modified_at":"2026-02-20T23:11:44.213000","content_hash":"fc10ec30ea5b4a1c2f9474eb544e7a0f111f9a0aa5fcab86d898461ae76d9b3c"},"entries":[],"content":"# MTSamples Procedures\n\n## Overview\n- **Environment ID**: `mtsamples_procedures`\n- **Short description**: MTSamples Procedures is a benchmark composed of transcribed operative notes, focused on documenting surgical procedures. Each example presents a brief patient case involving a surgical intervention, and the model is tasked with generating a coherent and clinically accurate procedural summary or treatment plan.\n## Dataset\n- **Split sizes**:\n  - Evaluation: ~90 examples (all used for evaluation)\n  - Note: This is an evaluation-only benchmark with no predefined train/test splits\n- **Source**:\n  - [MTSamples medical transcription repository](https://github.com/raulista1997/benchmarkdata/tree/main/mtsample_procedure)\n  - Implementation based on [HELM's MTSamples Procedures scenario](https://github.com/stanford-crfm/helm/blob/51c3389f3820b940cca2fcb759dfe8f0b0160f46/src/helm/benchmark/scenarios/mtsamples_procedures_scenario.py)\n\n## Task\n- **Type**: Single-Turn\n- **Rubric overview**: MultiJudgeRubric (LLM-as-a-Judge evaluation adapted from HELM's MTSamples Procedures Annotator)\n- **Task description**: Given patient notes (procedure note with PLAN/SUMMARY/FINDINGS sections removed), generate a reasonable treatment plan\n- **Prompt**: \"Here are information about a patient, return a reasonable treatment plan for the patient.\"\n- **Evaluation dimensions**:\n  - **Accuracy** (1-5): Does the response provide correct clinical advice that follows established clinical guidelines?\n  - **Completeness** (1-5): Does the response include all important aspects of patient care mentioned in the reference?\n  - **Clarity** (1-5): Is the response written clearly and organized in a way that is easy to read for clinicians?\n\n## Quickstart\nRun an evaluation with default settings:\n\n```bash\nprime eval run mtsamples_procedures -m \"openai/gpt-5-mini\" -n 5 -s\n```\n\nRun a single-judge evaluation:\n\n```bash\nmedarc-eval mtsamples_procedures -m \"openai/gpt-5-mini\" -n 5 --judge-model \"openai/gpt-5-mini\"\n```\n\nThis environment supports multi-judge via the `medarc-verifiers` `MultiJudge` rubric. When multiple judge models are specified, each scores independently and the final reward is their mean. `judge_model`, `judge_base_url`, and `judge_api_key` are all list-valued and correspond positionally; if only one `judge_base_url` or `judge_api_key` is provided, it is used for all judge models.\n\nUse configured default judges:\n\n```bash\nmedarc-eval mtsamples_procedures -m \"openai/gpt-5-mini\" --judge-model \"openai/gpt-5-mini\" --judge-model \"x-ai/grok-4.1-fast\"\n```\n\nNotes:\n- Use direct environment flags with `medarc-eval` (for example, `--split validation` or `--judge-model gpt-5-mini`).\n\n## Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `cache_dir` | str \\| Path \\| None | `~/.cache/medarc/mtsamples_procedures` | Local directory to cache downloaded datasets. Can also be set via `MTSAMPLES_PROCEDURES_CACHE_DIR` environment variable. |\n| `use_think` | bool | `False` | Whether to use chain-of-thought prompting with `<think>...</think>` tags |\n| `judge_model` | str \\| list[str] | `\"gpt-4o-mini\"` | Model identifier(s) for the LLM judge evaluating procedural plans |\n| `judge_base_url` | str \\| list[str] \\| None | `None` | Custom API base URL(s) for judge model (defaults to OpenAI API) |\n| `judge_api_key` | str \\| list[str] \\| None | `None` | API key(s) for judge model. Falls back to `JUDGE_API_KEY` environment variable if not provided |\n\n## Results Dataset Structure\n### Core Evaluation Fields\n\n- **`prompt`** - The patient notes presented to the model (list of message objects with `role` and `content`)\n- **`completion`** - The model's generated treatment plan (list of message objects)\n- **`reward`** - Overall score from 0.0 to 1.0, calculated as the average of normalized dimension scores: `(accuracy/5 + completeness/5 + clarity/5) / 3`\n\n### Example Metadata (`info`)\nContains all the MTSamples-specific information about each procedure:\n\n- **`filename`** - Original filename from the GitHub repository\n- **`extracted_section`** - Which section was used as reference (\"PLAN\", \"SUMMARY\", or \"FINDINGS\")\n- **`procedure_note`** - The patient notes with sections removed (same as `question` field)\n- **`reference_plan`** - Gold standard treatment plan/summary (same as `answer` field)\n- **`judge_feedback`** - List of judge evaluations with scores and explanations for each dimension\n\n### Notes\n\n- The `question` field contains everything BEFORE the first PLAN/SUMMARY/FINDINGS section (HELM's exact approach)\n- The `answer` field contains the first line after the prioritized section header (PLAN > SUMMARY > FINDINGS)\n- Scores are normalized to 0-1 by dividing each dimension score (1-5) by 5 and averaging across dimensions\n- If judge response parsing fails, dimension scores default to `None` and do not contribute to the final reward\n\n## Data Processing\n\nFollowing HELM's exact approach:\n1. **Section Extraction**: Extracts the first line after `PLAN:`, `SUMMARY:`, or `FINDINGS:` headers (priority order: PLAN > SUMMARY > FINDINGS)\n2. **Input Cleaning**: Takes everything BEFORE the first section header found as the input\n3. **Reference**: The extracted section content becomes the gold standard answer\n\n## Dataset Examples\n\n**Example: AC Separation Revision & Hardware Removal**\n```\nPatient Notes (input):\nMedical Specialty:Orthopedic\nSample Name: AC Separation Revision & Hardware Removal\nDescription: Removal of the hardware and revision of right AC separation...\nPREOPERATIVE DIAGNOSIS: Right AC separation.\nPOSTOPERATIVE DIAGNOSIS: Right AC separation.\nPROCEDURES: Removal of the hardware and revision of right AC separation.\nANESTHESIA: General.\nBLOOD LOSS: 100 cc.\nCOMPLICATIONS: None.\n\nReference Answer (extracted from SUMMARY section):\nAfter informed consent was obtained and verified, the patient was brought to\nthe operating room and placed supine on the operating table. After uneventful\ngeneral anesthesia was obtained, he was positioned in the beach chair...\n\nGenerated Treatment Plan:\n1. Postoperative Care: Monitor vital signs and surgical site for signs of infection...\n2. Immobilization: Use of a sling or shoulder immobilizer for 2-4 weeks...\n3. Physical Therapy: Begin passive range of motion exercises around 2-3 weeks post-op...\n4. Follow-up: Schedule follow-up visits at 1-2 weeks post-op for wound check...\n```\n\n## References\n\n\n**HELM MTSamples Implementation**\n```bibtex\n@misc{helm2023,\n  title={Holistic Evaluation of Language Models},\n  author={Liang, Percy and Bommasani, Rishi and Lee, Tony and others},\n  year={2023},\n  url={https://github.com/stanford-crfm/helm}\n}\n```\n","encoding":"utf-8","truncated":false,"total_bytes":6669},"status":null}