{"data":{"kind":"file","path":"README.md","version_id":"b51sd5yptpzkv7cit96fdd8c","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2391,"modified_at":"2026-01-24T16:12:23.963000","content_hash":"70b4803f59445dc052e0ad4ecffec8a592c4fd65aa334fd152fa8056c4758ab1"},"entries":[],"content":"# HEAD-QA\n\nEvaluation environment for the HEAD-QA dataset.\n\n### Overview\n- **Environment ID**: `head-qa`\n- **Short description**: Single-turn medical multiple-choice QA \n- **Tags**: medical, single-turn, multiple-choice, train, eval\n\n### Datasets\n- **Primary dataset(s)**: HEAD-QA (HF datasets)\n- **Source links**: [EleutherAI/headqa](https://huggingface.co/datasets/EleutherAI/headqa) \n- **Split sizes**: Uses provided train and validation splits\n\n### Task\n- **Type**: Single-turn\n- **Parser**: `XMLParser` or `Parser` (BOXED format) depending on `answer_format`, uses `ThinkParser` when `use_think=True`\n- **Rubric overview**: Binary scoring (1.0 / 0.0), based on correct answer.\n- **Reward function:** `accuracy` — returns 1.0 if the predicted answer matches, else 0.0.\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval head_qa\n```\n\n### Usage\nTo run an evaluation using `vf-eval` with the OpenAI API:\n\n```bash\nexport OPENAI_API_KEY=sk-...\nuv run vf-eval \\\n  -m gpt-4.1-mini \\\n  -n 5 \\\n  -s \\\n  head_qa\n```\nReplace `OPENAI_API_KEY` with your actual API key.\n\n### Authors\nThis environment has been put together by:\n\nRatna Sagari Grandhi - ([@sagarigrandhi](https://github.com/sagarigrandhi))\n\n### Credits \nDataset:\n```bibtex\n@inproceedings{vilares-gomez-rodriguez-2019-head,\n    title = \"{HEAD}-{QA}: A Healthcare Dataset for Complex Reasoning\",\n    author = \"Vilares, David  and\n      G{\\'o}mez-Rodr{\\'i}guez, Carlos\",\n    booktitle = \"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics\",\n    month = jul,\n    year = \"2019\",\n    address = \"Florence, Italy\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://www.aclweb.org/anthology/P19-1092\",\n    doi = \"10.18653/v1/P19-1092\",\n    pages = \"960--966\",\n    abstract = \"We present HEAD-QA, a multi-choice question answering testbed to encourage research on complex reasoning. The questions come from exams to access a specialized position in the Spanish healthcare system, and are challenging even for highly specialized humans. We then consider monolingual (Spanish) and cross-lingual (to English) experiments with information retrieval and neural techniques. We show that: (i) HEAD-QA challenges current methods, and (ii) the results lag well behind human performance, demonstrating its usefulness as a benchmark for future work.\",\n}\n```\n\n","encoding":"utf-8","truncated":false,"total_bytes":2391},"status":null}