{"data":{"kind":"file","path":"README.md","version_id":"oxnt0jwgnd29vmupznkwx0fu","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2233,"modified_at":"2026-02-20T23:11:44.192000","content_hash":"30b11b7057182850af17367259534663642b1c7841dfd688ba7584c1466dc532"},"entries":[],"content":"# HEAD-QA\n\nEvaluation environment for the HEAD-QA dataset.\n\n## Overview\n- **Environment ID**: `head-qa`\n- **Short description**: Single-turn medical multiple-choice QA \n- **Tags**: medical, single-turn, multiple-choice, train, eval\n\n## Datasets\n- **Primary dataset(s)**: HEAD-QA (HF datasets)\n- **Source links**: [EleutherAI/headqa](https://huggingface.co/datasets/EleutherAI/headqa) \n- **Split sizes**: Uses provided train and validation splits\n\n## Task\n- **Type**: Single-turn\n- **Rubric overview**: Binary scoring (1.0 / 0.0), based on correct answer.\n- **Reward function:** `accuracy` — returns 1.0 if the predicted answer matches, else 0.0.\n\n## Quickstart\nRun an evaluation with default settings:\n\n```bash\nprime eval run head_qa -m \"openai/gpt-5-mini\" -n 5 -s\n```\n\n## Usage\nTo run an evaluation using `medarc-eval`:\n\n```bash\nmedarc-eval head_qa -m \"openai/gpt-5-mini\" -n 5 -s\n```\nReplace `OPENAI_API_KEY` with your actual API key.\n\n## Authors\nThis environment has been put together by:\n\nRatna Sagari Grandhi - ([@sagarigrandhi](https://github.com/sagarigrandhi))\n\n## Credits \nDataset:\n```bibtex\n@inproceedings{vilares-gomez-rodriguez-2019-head,\n    title = \"{HEAD}-{QA}: A Healthcare Dataset for Complex Reasoning\",\n    author = \"Vilares, David  and\n      G{\\'o}mez-Rodr{\\'i}guez, Carlos\",\n    booktitle = \"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics\",\n    month = jul,\n    year = \"2019\",\n    address = \"Florence, Italy\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://www.aclweb.org/anthology/P19-1092\",\n    doi = \"10.18653/v1/P19-1092\",\n    pages = \"960--966\",\n    abstract = \"We present HEAD-QA, a multi-choice question answering testbed to encourage research on complex reasoning. The questions come from exams to access a specialized position in the Spanish healthcare system, and are challenging even for highly specialized humans. We then consider monolingual (Spanish) and cross-lingual (to English) experiments with information retrieval and neural techniques. We show that: (i) HEAD-QA challenges current methods, and (ii) the results lag well behind human performance, demonstrating its usefulness as a benchmark for future work.\",\n}\n```\n","encoding":"utf-8","truncated":false,"total_bytes":2233},"status":null}