{"data":{"kind":"file","path":"README.md","version_id":"fknoppgw6kz4vdw1z6zdzqpv","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2231,"modified_at":"2026-02-20T23:11:44.192000","content_hash":"51b95c39a474204ee5c46d6cf36d4c816530f8fbbff2870f0a2635740bf5db15"},"entries":[],"content":"# HEAD-QA v2\n\nEvaluation environment for the HEAD-QA v2 dataset.\n\n## Overview\n- **Environment ID**: `head-qa-v2`\n- **Short description**: Single-turn medical multiple-choice QA \n- **Tags**: medical, single-turn, multiple-choice, eval\n\n## Datasets\n- **Primary dataset(s)**: HEAD-QA v2 (HF datasets)\n- **Source links**: [alesi12/head_qa_v2](https://huggingface.co/datasets/alesi12/head_qa_v2) \n- **Split sizes**: Uses the provided train split for evaluation\n\n## Task\n- **Type**: Single-turn\n- **Rubric overview**: Binary scoring (1.0 / 0.0), based on correct answer\n- **Reward function:** `accuracy` — returns 1.0 if the predicted answer matches, else 0.0.\n\n## Quickstart\nRun an evaluation with default settings:\n\n```bash\nprime eval run head-qa-v2 -m \"openai/gpt-5-mini\" -n 5 -s\n```\n\n## Usage\nTo run an evaluation using medarc-eval:\n\n```bash\nmedarc-eval head-qa-v2 -m \"openai/gpt-5-mini\" -n 5 -s --shuffle-answers --shuffle-seed 42\n```\n\n## Authors\nThis environment has been put together by:\n\nRatna Sagari Grandhi - ([@sagarigrandhi](https://github.com/sagarigrandhi))\n\n## Credits \nDataset:\n```bibtex\n@inproceedings{vilares-gomez-rodriguez-2019-head,\n    title = \"{HEAD}-{QA}: A Healthcare Dataset for Complex Reasoning\",\n    author = \"Vilares, David  and\n      G{\\'o}mez-Rodr{\\'i}guez, Carlos\",\n    booktitle = \"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics\",\n    month = jul,\n    year = \"2019\",\n    address = \"Florence, Italy\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://www.aclweb.org/anthology/P19-1092\",\n    doi = \"10.18653/v1/P19-1092\",\n    pages = \"960--966\",\n    abstract = \"We present HEAD-QA, a multi-choice question answering testbed to encourage research on complex reasoning. The questions come from exams to access a specialized position in the Spanish healthcare system, and are challenging even for highly specialized humans. We then consider monolingual (Spanish) and cross-lingual (to English) experiments with information retrieval and neural techniques. We show that: (i) HEAD-QA challenges current methods, and (ii) the results lag well behind human performance, demonstrating its usefulness as a benchmark for future work.\",\n}\n```\n","encoding":"utf-8","truncated":false,"total_bytes":2231},"status":null}