{"data":{"kind":"file","path":"README.md","version_id":"bb1ltihf9rqq7s3t95xcu05g","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":6259,"modified_at":"2026-02-20T23:11:44.184000","content_hash":"2b6b028b339bb10d558dcaf1e69271226ffee567716ae5e5b061e8d795d9a404"},"entries":[],"content":"# ACI-Bench\n\n## Overview\n- **Environment ID**: `aci-bench`\n- **Short description**: Convert doctor-patient dialogue into structured clinical notes.\n- **Tags**: medical, clinical, dialogue, summarization,llm-judge, single-turn, train, eval, test\n\n## Datasets\n- **Primary dataset**: `ACI-Bench`\n- **Source links**: [Paper](https://www.nature.com/articles/s41597-023-02487-3), [Github](https://github.com/wyim/aci-bench), [HF Dataset](https://huggingface.co/datasets/mkieffer/ACI-Bench-MedARC)\n- **Split sizes**: \n\n| subset     | transcript_version | train | valid | test1 | test2 | test3 | Total |\n| ---------- | ------------------ | ----- | ----- | ----- | ----- | ----- | ----- |\n| aci        | asr                | 35    | 11    | 22    | 22    | 22    | 112   |\n| aci        | asrcorr            | 35    | 11    | 22    | 22    | 22    | 112   |\n| aci        | humantrans         | 0     | 0     | 0     | 0     | 0     | 0     |\n| virtassist | asr                | 0     | 0     | 0     | 0     | 0     | 0     |\n| virtassist | asrcorr            | 0     | 0     | 0     | 0     | 0     | 0     |\n| virtassist | humantrans         | 20    | 5     | 10    | 10    | 10    | 55    |\n| virtscribe | asr                | 12    | 4     | 8     | 8     | 8     | 40    |\n| virtscribe | asrcorr            | 0     | 0     | 0     | 0     | 0     | 0     |\n| virtscribe | humantrans         | 12    | 4     | 8     | 8     | 8     | 40    |\n| ALL        | ALL                | 114   | 35    | 70    | 70    | 70    | 359   |\n\n\nThe dataset consists of different subsets capturing different clinical workflows:\n1) ambient clinical intelligence (`aci`): doctor-patient dialogue\n2) virtual assistant (`virtassist`): doctor-patient dialogue with queues to trigger Dragon Copilot, e.g., \"hey, dragon. show me the chest x-ray\"\n3) virtual scribe (`virtscribe`): doctor-patient dialogue with a short dictation from the doctor about the patient at the very beginning\n\nThere are three different transcription versions:\n1) `asr`: machine-transcribed\n2) `asrcorr`: human corrections to `asr`, for example: \"nonsmile\" in D2N081 --> \"non-small\" in ACI006\n3) `humantrans`: transcribed by a human\n\nThe subsets have the following transcription versions:\n1) `aci`: `asr` and `asrcorr`\n2) `virtassist`: `humantrans` only\n3) `virtscribe`: `asr` and `humantrans`\n\n\n## Task\n- **Type**: single-turn\n- **Rubric overview**: LLM-as-a-judge evaluation using prompts adapted from MedHELM (single or multi-judge)\n- **Evaluation dimensions**:\n  - **Accuracy** (1-5): Does the clinical note correctly capture the main medical issue and clinical details?\n  - **Completeness** (1-5): Does the clinical note include all important medical information?\n  - **Clarity** (1-5): Is the clinical note easy to understand for clinical use?\n\n## Quickstart\n\nRun a quick evaluation with `prime eval`:\n\n```bash\nprime eval run aci-bench -m \"openai/gpt-5-mini\" -n 5 -s\n```\n\nTo pass environment-specific options, use `--env-args` (JSON).\n\n```bash\nprime eval run aci-bench -m \"openai/gpt-5-mini\" -n 5 --env-args '{\"subset\": \"aci\", \"judge_model\": \"openai/gpt-5-mini\"}'\n```\n\nOr use `medarc-eval` for named flags:\n\n```bash\nmedarc-eval aci-bench -m \"openai/gpt-5-mini\" -n 5 --subset aci --judge-model \"openai/gpt-5-mini\"\n```\n\nThis environment supports multi-judge via the `medarc-verifiers` `MultiJudge` rubric. When multiple judge models are specified, each scores independently and the final reward is their mean. `judge_model`, `judge_base_url`, and `judge_api_key` are all list-valued and correspond positionally; if only one `judge_base_url` or `judge_api_key` is provided, it is used for all judge models.\n\n```bash\nmedarc-eval aci-bench -m \"openai/gpt-5-mini\" --judge-model \"openai/gpt-5-mini\" --judge-model \"google/gemini-3-flash\"\n```\n\n## Environment Arguments\n\n| Arg                  | Type | Default | Description                                                                                                                                                                          |\n| -------------------- | ---- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |\n| `subset`             | str  | `all`| The subset of the dataset to use (`all`, `aci`, `virtassist`, `virtscribe`)|\n| `transcript_version` | str  | `all`| The transcript version to use (`all`, `asr`, `asrcorr`, `humantrans`)|\n| `answer_format`      | str  | `xml` | The format of the answer (`xml`, `boxed`)|\n| `system_prompt`      | str \\| None  | `None` | Optional system prompt override |\n| `judge_model`        | str \\| list[str]  | `openai/gpt-5-mini` | Model identifier(s) for the LLM judge |\n| `judge_base_url`     | str \\| list[str]  | `None` | Custom API base URL(s) for judge model (defaults to OpenAI API) |\n| `judge_api_key`      | str \\| list[str]  | `None` | API key(s) for judge model. Falls back to `JUDGE_API_KEY` environment variable if not provided |\n\n\n### Notes\n\n- The `question` field in the dataset maps to the full conversation text\n- The `answer` field contains the gold standard summary (also available as `reference_response` in `info`)\n- Scores are normalized to 0-1 by dividing each dimension score (1-5) by 5 and averaging across dimensions\n- If judge response parsing fails, dimension scores default to `None` and do not contribute to the final reward\n\n## Dataset Examples\n\n```\nDialogue:\n[doctor] good morning julie how are you doing this morning\n[patient] i've been better my primary care doctor wanted me to see you because of this this knee pain that i've been having for about six months now\n...\n\nNote:\nCHIEF COMPLAINT\nBilateral knee pain.\n\nSOCIAL HISTORY\nThe patient is an avid runner. She also works from home.\n...\n```\n## References\n\n```bibtex\n@article{aci-bench,\n  author = {Wen{-}wai Yim and\n                Yujuan Fu and\n                Asma {Ben Abacha} and\n                Neal Snider and Thomas Lin and Meliha Yetisgen},\n  title = {ACI-BENCH: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation},\n  journal = {Nature Scientific Data},\n  year = {2023}\n}\n```\n","encoding":"utf-8","truncated":false,"total_bytes":6259},"status":null}