{"data":{"kind":"file","path":"README.md","version_id":"h6ftu3lau3ukr77yku5s3kek","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":8395,"modified_at":"2026-02-20T23:11:44.201000","content_hash":"0b9debd400b584d987a6d7af4a5dcb824d7a7c9620c9c0906e369342cfb6e92a"},"entries":[],"content":"# medexqa-env- by mnishant2\n\n## Overview\n- **Environment ID**: `medexqa`\n- **Short description**: Medical QA with multiple-choice questions and explanations across five underrepresented medical specialties\n- **Tags**: medical, clinical, single-turn, multiple-choice, explanations, evaluation\n\n## Datasets\n- **Primary dataset(s)**: MedExQA\n- **Source links**: [Paper](https://arxiv.org/abs/2406.06331), [HuggingFace Dataset](https://huggingface.co/datasets/bluesky333/MedExQA), [GitHub](https://github.com/knowlab/MedExQA)\n- **Split sizes**:\n\n    | Specialty                   | Dev | Test | Total |\n    | --------------------------- | --- | ---- | ----- |\n    | Biomedical Engineering      | 4   | 144  | 148   |\n    | Clinical Laboratory Science | 9   | 368  | 377   |\n    | Clinical Psychology         | 3   | 108  | 111   |\n    | Occupational Therapy        | 5   | 189  | 194   |\n    | Speech Language Pathology   | 4   | 131  | 135   |\n    | **Total**                   | **25** | **940** | **965** |\n\n## Task\n- **Type**: single-turn\n- **Prompting**: Uses the authors' instruction embedded in the user message; options A/B/C/D are included.\n  ```\n  The following is a multiple-choice question. Please choose the most suitable one among A, B, C and D as the answer to this question. Your answer should be paired with an explanation why you chose that answer.\n  ```\n- **Answer extraction [authors' logic](https://github.com/knowlab/MedExQA/blob/9a5b34af103b0c8ba0c00906e278f6572249fafa/evaluate_pipe_MedExQA.py)** :\n  - Canonical letter extraction using a sequence of regex patterns (e.g., explicit \"Answer is A:\", leading letter, etc.)\n  - If no explicit letter is found, fuzzy matching (thefuzz) maps the generated text to the closest option and returns the corresponding letter\n- Run Evaluation per specialty or on multiple specialties\n- Use lexical metrics('rougeL', 'bleu', 'bertscore', 'meteor') or use an LLM-as-a-judge for explanation evaluation\n- **Rubric overview**:\n  - MCQ accuracy: 0 or 100 per example\n  - Explanation score: 0–100 per example (lexical metrics average); 0 if the answer is wrong\n  - Combined reward: explanation grading is only applied when the MCQ answer is correct\n- **Model Download**:\n  In the first run it will download `wordnet`, `NLTK` and `sciBERT` models for running the lexical metrics\n\n## Quickstart\n\n- Run MCQ-only (no explanation scoring):\n```bash\nprime eval run medexqa -m gpt-5-mini -n 5 -s\n```\n\n- Run with explanation scoring (lexical metrics):\n```bash\nmedarc-eval medexqa -m gpt-5-mini --use-explanations\n```\n\n- Use LLM-as-judge for explanations (instead of lexical metrics):\n```bash\nexport JUDGE_API_KEY=sk-...\nmedarc-eval medexqa -m \"openai/gpt-5-mini\" -n 10 -s --use-explanations --use-judge --judge-model \"openai/gpt-5-mini\"\n```\n\nThis environment supports multi-judge via the `medarc-verifiers` `MultiJudge` rubric. When multiple judge models are specified, each scores independently and the final reward is their mean. `judge_model`, `judge_base_url`, and `judge_api_key` are all list-valued and correspond positionally; if only one `judge_base_url` or `judge_api_key` is provided, it is used for all judge models.\n\n- Configured multi-judge example, with one change from defaults (`--use-judge`):\n```bash\nmedarc-eval medexqa -m \"openai/gpt-5-mini\" -n 10 -s --use-explanations --use-judge --judge-model \"openai/gpt-5-mini\" --judge-model \"google/gemini-3-flash-preview\"\n```\n\n- Configure sampling and rollouts:\n```bash\nmedarc-eval medexqa -m gpt-5-mini -n -1 --use-explanations --explanation-metrics all\n```\n\n## Environment Arguments\n\n| Arg                    | Type                   | Default        | Description |\n| ---------------------- | ---------------------- | -------------- | ----------- |\n| `specialty`            | list[str] \\/ str \\| None | `None`         | Select one or more specialties. Codes: `BE`, `CLS`, `CP`, `OT`, `SLP`. `None`\\/`ALL` loads all. |\n| `use_explanations`     | bool                   | `True`         | Whether to compute explanation scores. |\n| `shuffle_answers`      | bool                   | `False`        | Whether to shuffle answer choices in each question. |\n| `shuffle_seed`         | int \\| None            | `1618`         | Seed for deterministic answer shuffling. |\n| `cache_dir`            | str \\| Path \\| None    | `None`         | Local cache path for downloaded specialty files. |\n| `explanation_metrics`  | list[str] \\/ str \\| None | `None`         | Lexical metrics to use: any of `rougeL`, `bleu`, `meteor`, `bertscore`. `None`\\/`\"all\"` averages all four. |\n| `use_judge`            | bool                   | `True`         | Use LLM-as-judge for explanations instead of lexical metrics. |\n| `judge_model`          | str \\| list[str]       | `gpt-4o-mini`  | Judge model name(s). |\n| `judge_base_url`       | str \\| list[str] \\| None | `None`       | Judge API base URL(s). |\n| `judge_api_key`        | str \\| list[str] \\| None | `None`       | Judge API key(s) (falls back to `JUDGE_API_KEY` or `OPENAI_API_KEY`). |\n\n## Metrics\n\n- **Answer accuracy (per example)**: 0 or 100. Uses authors' regex+fuzzy logic to extract a letter.\n- **Explanation score (per example)**: 0–100. If the answer is wrong, the explanation score is 0.\n  - Lexical metrics supported: `rougeL`, `bleu`, `meteor`, `bertscore` (w/ SciBERT `allenai/scibert_scivocab_uncased`).\n  - Selection via `explanation_metrics` (list or `'all'`/`None` to average all four).\n- **Combined score**: `mcq_weight * accuracy + explanation_weight * explanation`.\n\nOptional LLM-as-judge for explanations:\n- Set `use_explanations=true` and `use_judge=true` to replace lexical metrics with judge scoring (0–100 after scaling).\n- Criteria include medical accuracy, relevance, clarity, completeness, and use of medical concepts. 0 if the answer from string matching is wrong.\n\n## Specialty Selection and Macro Average\n\n- Single specialty by code:\n```bash\nmedarc-eval medexqa -m gpt-5-mini --specialty CLS\n```\n\n- Multiple specialties:\n```bash\nmedarc-eval medexqa -m gpt-5-mini --specialty CLS --specialty CP\n```\n\n- All specialties:\n```bash\nmedarc-eval medexqa -m gpt-5-mini --specialty ALL\n```\n\n## IMPORTANT: Macro-average accuracy (as reported in the paper):\n- Run each specialty separately and average the per-run average answer accuracies; or\n- Run multiple specialties with `-s` to save results. Each saved example includes its `specialty` in `info`, along with the `per-example answer_accuracy_reward`. Use the saved JSONL to compute per-specialty accuracies and then take the unweighted mean across specialties.\n\n## Testing Instructions\n\n### 1. Environment Setup\n```bash\n# Navigate to repository root\ncd /data/storage_hpc_nishant/med-lm-envs\n\n# Sync uv environment\nuv sync\n```\n\n### 2. Quick Validation Test (MCQ-only)\n```bash\nmedarc-eval medexqa -m gpt-5-mini -n 5 --no-use-explanations\n```\n\n### 3. Full Evaluation with Save\n```bash\nexport OPENAI_API_KEY=sk-...\nmedarc-eval medexqa -m gpt-5-mini -n -1 -s --specialty ALL --use-explanations\n```\n\n### 4. LLM-as-Judge for Explanations\n```bash\nexport JUDGE_API_KEY=sk-...\nmedarc-eval medexqa -m gpt-5-mini -n -1 -s --use-explanations --use-judge --judge-model openai/gpt-5-mini --judge-model google/gemini-3-flash-preview\n```\n\n### 5. With Shuffled Choices\n```bash\nmedarc-eval medexqa -m gpt-5-mini -n -1 --shuffle-answers --shuffle-seed 42\n```\n\n### 6. Example Run with openrouter \n```bash\nexport OPENROUTER_API_KEY=....\nmedarc-eval medexqa -m gpt-5-mini -b https://openrouter.ai/api/v1 -k OPENAI_API_KEY -n 10 -c 1 --use-explanations --explanation-metrics all --specialty BE --specialty OT -s\n```\noutput \n```bash\nRewards:\nreward: avg - 59.416, std - 19.928\nr1: [67.79, 65.809, 64.158, 66.619, 69.124, 0.0, 66.957, 66.327, 66.87, 60.503]\nanswer_accuracy_reward: avg - 90.000, std - 30.000\nr1: [100.0, 100.0, 100.0, 100.0, 100.0, 0.0, 100.0, 100.0, 100.0, 100.0]\nexplanation_reward: avg - 28.832, std - 10.577\nr1: [35.58, 31.618, 28.316, 33.239, 38.249, 0.0, 33.915, 32.653, 33.741, 21.006]\n```\n## Authors\nThis environment has been put together by:\n\nNishant Mishra - ([mnishant2](https://github.com/mnishant2))\n\n## Citation\n\n```bibtex\n@article{kim2024medexqa,\n  title={MedExQA: Medical Question Answering Benchmark with Multiple Explanations},\n  author={Kim, Yunsoo and Wu, Jinge and Abdulle, Yusuf and Wu, Honghan},\n  journal={arXiv preprint arXiv:2406.06331},\n  year={2024}\n}\n```\n","encoding":"utf-8","truncated":false,"total_bytes":8395},"status":null}