{"data":{"kind":"file","path":"README.md","version_id":"g2ixycljvxv8ita2m054qo1w","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":5595,"modified_at":"2026-02-20T23:11:44.205000","content_hash":"dbcef51fd03b52a48e196c3417f3844faee2023bb746c7b4cf10fe283703727d"},"entries":[],"content":"# MedicationQA\n\n## Overview\n- **Environment ID**: `medicationqa`\n- **Short description**: MedicationQA (MedInfo 2019) is a benchmark of single-turn, consumer-style medication questions and expert answers. It evaluates how well models can provide safe, accurate, and complete responses to medication-related inquiries.\n\n---\n\n## Dataset\n- **Source**: [Medication_QA_MedInfo2019](https://github.com/abachaa/Medication_QA_MedInfo2019)\n- **Publication**: Abacha & Demner-Fushman, *MedInfo 2019* – “A Question-Answering Dataset for Medication Safety”\n- **Split sizes**:\n  - **Test:** 690  _(full dataset; no official train/validation partitions)_\n- **Notes:**  \n  The original dataset does not define official splits.  \n  For consistency with other single-turn environments (e.g., `med_dialog`), this environment exposes the *entire* dataset as the `test` split.\n\n---\n\n## Task\n- **Type:** Single-Turn QA  \n- **Rubric:** LLM-as-a-Judge (adapted from MedHELM / MedDialog)  \n- **Evaluation dimensions:**\n  - **Accuracy (1–5)** – factual correctness of the answer  \n  - **Completeness (1–5)** – inclusion of all relevant medication details  \n  - **Clarity (1–5)** – readability and understandability for lay users  \n\n---\n\n## Quickstart\n\nRun an evaluation with default settings:\n\n```bash\nprime eval run medicationqa -m \"openai/gpt-5-mini\" -n 5 -s\n```\n\nJudge examples:\n\n```bash\nmedarc-eval medicationqa -m \"openai/gpt-5-mini\" -n 10 -s --judge-model \"openai/gpt-5-mini\"\n```\n\nThis environment supports multi-judge via the `medarc-verifiers` `MultiJudge` rubric. When multiple judge models are specified, each scores independently and the final reward is their mean. `judge_model`, `judge_base_url`, and `judge_api_key` are all list-valued and correspond positionally; if only one `judge_base_url` or `judge_api_key` is provided, it is used for all judge models.\n\n```bash\nmedarc-eval medicationqa -m \"openai/gpt-5-mini\" -n 10 -s --judge-model \"openai/gpt-5-mini\" --judge-model \"x-ai/grok-4.1-fast\"\n```\n\n**Notes**\n- Use direct environment flags with `medarc-eval` (for example, `--split validation` or `--judge-model gpt-5-mini`).\n\n---\n\n## Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `cache_dir` | `str \\| Path \\| None` | `~/.cache/medicationqa` | Local directory to cache the MedicationQA dataset. |\n| `judge_model` | `str \\| list[str]` | `\"gpt-4o-mini\"` | Model identifier(s) for the LLM judge(s). |\n| `judge_base_url` | `str \\| list[str] \\| None` | `None` | Custom API base URL(s) (e.g. for Ollama or local models). |\n| `judge_api_key` | `str \\| list[str] \\| None` | `None` | API key(s) for the judge model(s) (falls back to `JUDGE_API_KEY`). |\n\n---\n\n## Results Dataset Structure\n\n### Core Evaluation Fields\n- **`prompt`** – The medication question presented to the model (either as plain text or a list of chat-style message objects with `role` and `content`).  \n- **`completion`** – The model-generated answer, represented as a list of message objects (e.g., `[{\"role\": \"assistant\", \"content\": \"...\"}]`).  \n- **`reward`** – Normalized score in `[0, 1]`, computed as the average of the three dimension scores: `(accuracy/5 + completeness/5 + clarity/5) / 3`.\n\n\n### Example Metadata (`info`)\n- **`id`** – Unique identifier for each question (e.g. `medicationqa_42`).  \n- **`question`** – Consumer-style medication question.  \n- **`reference_answer`** – Gold reference answer from the MedInfo dataset.  \n- **`question_type`**, **`question_focus`**, etc. – Original metadata fields included as context.\n\n---\n\n## Example\n\n```\nQuestion:\nHow does rivastigmine interact with over-the-counter sleep medication?\n\nReference Answer:\nTell your doctor and pharmacist what prescription and nonprescription medications,\nvitamins, nutritional supplements, and herbal products you are taking or plan to take.\nBe sure to mention any of the following: antihistamines; aspirin and other NSAIDs such\nas ibuprofen and naproxen; bethanechol; ipratropium; and medications for Alzheimer’s disease,\nglaucoma, irritable bowel disease, motion sickness, ulcers, or urinary problems.\nYour doctor may need to change the doses of your medications or monitor you carefully for side effects.\n```\n\n---\n\n## References\n\n**MedicationQA Dataset**\n```bibtex\n@inproceedings{BenAbacha:MEDINFO19, \n\t   author    = {Asma {Ben Abacha} and Yassine Mrabet and Mark Sharp and\n   Travis Goodwin and Sonya E. Shooshan and Dina Demner{-}Fushman},    \n\t   title     = {Bridging the Gap between Consumers’ Medication Questions and Trusted Answers}, \n\t   booktitle = {MEDINFO 2019},   \n\t   year      = {2019}, \n   abstract = {This paper addresses the task of answering consumer health questions about medications. To better understand the challenge and needs in terms of methods and resources, we first introduce a gold standard corpus for Medication Question Answering created using real consumer questions. The gold standard consists of six hundred and seventy-four question-answer pairs with annotations of the question focus and type and the answer source. We first present the manual annotation and answering process. In the second part of this paper, we test the performance of recurrent and convolutional neural networks in question type identification and focus recognition. Finally, we discuss the research insights from both the dataset creation process and our experiments. This study provides new resources and experiments on answering consumers’ medication questions and discusses the limitations and directions for future research efforts.}}  \n}\n```\n","encoding":"utf-8","truncated":false,"total_bytes":5595},"status":null}