{"data":{"kind":"file","path":"README.md","version_id":"o1kg55ocbcwoq4wcyal5c1rm","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":5816,"modified_at":"2026-02-20T23:11:44.192000","content_hash":"16a1c921fa79e3a07775f7579c848d50b314b77e89ad124d30c5f8452b2554d2"},"entries":[],"content":"# MedDialog (English)\n\n## Overview\n- **Environment ID**: `med_dialog`\n- **Short description**: MedDialog is a benchmark of real-world doctor-patient conversations focused on health-related concerns and advice. Each dialogue is paired with a one-sentence summary that reflects the core patient question or exchange. The benchmark evaluates a model's ability to condense medical dialogue into concise, informative summaries.\n\n## Dataset\n- **Split sizes**:\n  - Train: 205,973\n  - Valid: 25,746\n  - Test: 25,750\n- **Source**:\n  - [MedDialog: a large-scale medical dialogue dataset](https://arxiv.org/abs/2004.03329) (Chen et al., 2020)\n  - Preprocessing by MedHELM following [BioBART](https://arxiv.org/abs/2204.03905) (Yuan et al., 2022)\n  - Original dataset: [Medical-Dialogue-System](https://github.com/UCSD-AI4H/Medical-Dialogue-System) (Chen et al., 2020)\n\n## Task\n- **Type**: Single-Turn\n- **Rubric overview**: LLM-as-a-judge evaluation using prompts adapted from MedHELM (single or multi-judge)\n- **Evaluation dimensions**:\n  - **Accuracy** (1-5): Does the summary correctly capture the main medical issue and clinical details?\n  - **Completeness** (1-5): Does the summary include all important medical information?\n  - **Clarity** (1-5): Is the summary easy to understand for clinical use?\n\n## Quickstart\nRun an evaluation with default settings:\n\n```bash\nprime eval run med_dialog -m \"openai/gpt-5-mini\" -n 5 -s\n```\n\nJudge examples:\n\n```bash\nmedarc-eval med_dialog -m \"openai/gpt-5-mini\" -n 20 -s --judge-model \"openai/gpt-5-mini\"\n```\n\nThis environment supports multi-judge via the `medarc-verifiers` `MultiJudge` rubric. When multiple judge models are specified, each scores independently and the final reward is their mean. `judge_model`, `judge_base_url`, and `judge_api_key` are all list-valued and correspond positionally; if only one `judge_base_url` or `judge_api_key` is provided, it is used for all judge models.\n\n```bash\nmedarc-eval med_dialog -m \"openai/gpt-5-mini\" -n 20 -s --judge-model \"openai/gpt-5-mini\" --judge-model \"x-ai/grok-4.1-fast\"\n```\n\nNotes:\n- Use direct environment flags with `medarc-eval` (for example, `--split validation` or `--judge-model gpt-5-mini`).\n\n## Environment Arguments\nDocument any supported environment arguments and their meaning:\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `cache_dir` | str \\| Path \\| None | `~/.cache/meddialog` | Local directory to cache downloaded datasets. Can also be set via `MEDDIALOG_CACHE_DIR` environment variable. |\n| `judge_model` | str \\| list[str] | `\"gpt-4o-mini\"` | Model identifier(s) for the LLM judge evaluating summaries |\n| `judge_base_url` | str \\| list[str] \\| None | `None` | Custom API base URL(s) for judge model (defaults to OpenAI API) |\n| `judge_api_key` | str \\| list[str] \\| None | `None` | API key(s) for judge model. Falls back to `JUDGE_API_KEY` environment variable if not provided |\n\n## Results Dataset Structure\n### Core Evaluation Fields\n\n- **`prompt`** - The input conversation presented to the model (list of message objects with `role` and `content`)\n- **`completion`** - The model's generated summary (list of message objects)\n- **`reward`** - Overall score from 0.0 to 1.0, calculated as the average of normalized dimension scores: `(accuracy/5 + completeness/5 + clarity/5) / 3`\n\n### Example Metadata (`info`)\nContains all the MedDialog-specific information about each dialogue:\n\n- **`id`** - Unique identifier for the dialogue\n- **`conversation`** - The full patient-doctor conversation text\n- **`reference_response`** - Gold standard one-sentence summary\n- **`subset`** - Either `\"healthcaremagic\"` or `\"icliniq\"`\n- **`index`** - Original index in the source dataset\n\n### Notes\n\n- The `question` field in the dataset maps to the full conversation text\n- The `answer` field contains the gold standard summary (also available as `reference_response` in `info`)\n- Scores are normalized to 0-1 by dividing each dimension score (1-5) by 5 and averaging across dimensions\n- If judge response parsing fails, dimension scores default to `None` and do not contribute to the final reward\n\n## Dataset Examples\n\n```\nPatient: I get cramps on top of my left forearm and hand and it causes my hand and\nfingers to draw up and it hurts. It mainly does this when I bend my arm. I ve been\ntold that I have a slight pinch in a nerve in my neck. Could this be a cause? I don t\nthink so.\n\nDoctor: Hi there. It may sound difficult to believe it, but the nerves which supply\nyour forearms and hand, start at the level of spinal cord and on their way towards the\nforearm and hand regions which they supply, the course of these nerves pass through\ndifference fascial and muscular planes that can make them susceptible to entrapment\nneuropathies...\n\nSummary: Could painful forearms be related to pinched nerve in neck?\n```\n\n```\nPatient: Hello doctor, We are looking for a second opinion on my friend's MRI scan of\nboth the knee joints as he is experiencing excruciating pain just above the patella.\nHe has a sudden onset of severe pain on both the knee joints about two weeks ago...\n\nDoctor: Hi. I viewed the right and left knee MRI images. Left knee: The MRI, left knee\njoint shows a complex tear in the posterior horn of the medial meniscus area and mild\nleft knee joint effusion...\n\nSummary: My friend has excruciating knee pain. Please interpret his MRI report\n```\n\n## References\n\n**MedDialog Dataset**\n```bibtex\n@misc{chen2020meddiag,\n  title={MedDialog: a large-scale medical dialogue dataset},\n  author={Chen, Shu and Ju, Zeqian and Dong, Xiangyu and Fang, Hongchao and Wang, Sicheng and Yang, Yue and Zeng, Jiaqi and Zhang, Ruisi and Zhang, Ruoyu and Zhou, Meng and Zhu, Penghui and Xie, Pengtao},\n  publisher = {arXiv},\n  year={2020},\n  url = {https://arxiv.org/abs/2004.03329},\n}\n```\n","encoding":"utf-8","truncated":false,"total_bytes":5816},"status":null}