{"data":{"kind":"file","path":"README.md","version_id":"ffluydzfqwmur618ng6167x6","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":1971,"modified_at":"2026-01-26T19:21:45.584000","content_hash":"a1c8a63a422634114c9157997ffb8c576783a3e1f21d781a7759c03f41fb49b3"},"entries":[],"content":"# HEAD-QA Standalone\n\nEnvironment for the HEAD-QA dataset.\n\n### Overview\n- **Environment ID**: `head-qa`\n- **Short description**: Single-turn medical multiple-choice QA \n- **Tags**: medical, single-turn, multiple-choice, train, eval\n\n### Datasets\n- **Primary dataset(s)**: HEAD-QA (HF datasets)\n- **Source links**: [EleutherAI/headqa](https://huggingface.co/datasets/EleutherAI/headqa) \n- **Split sizes**: Uses provided train and validation splits\n\n### Usage\nTo run an evaluation using `vf-eval` with the OpenAI API:\n\n```bash\nexport OPENAI_API_KEY=sk-...\nuv run vf-eval \\\n  -m gpt-4.1-mini \\\n  -n 5 \\\n  -s \\\n  head_qa\n```\nReplace `OPENAI_API_KEY` with your actual API key.\n\n### Authors\nThis environment has been put together by:\n\nRatna Sagari Grandhi - ([@sagarigrandhi](https://github.com/sagarigrandhi))\n\n### Credits \nDataset:\n```bibtex\n@inproceedings{vilares-gomez-rodriguez-2019-head,\n    title = \"{HEAD}-{QA}: A Healthcare Dataset for Complex Reasoning\",\n    author = \"Vilares, David  and\n      G{\\'o}mez-Rodr{\\'i}guez, Carlos\",\n    booktitle = \"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics\",\n    month = jul,\n    year = \"2019\",\n    address = \"Florence, Italy\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://www.aclweb.org/anthology/P19-1092\",\n    doi = \"10.18653/v1/P19-1092\",\n    pages = \"960--966\",\n    abstract = \"We present HEAD-QA, a multi-choice question answering testbed to encourage research on complex reasoning. The questions come from exams to access a specialized position in the Spanish healthcare system, and are challenging even for highly specialized humans. We then consider monolingual (Spanish) and cross-lingual (to English) experiments with information retrieval and neural techniques. We show that: (i) HEAD-QA challenges current methods, and (ii) the results lag well behind human performance, demonstrating its usefulness as a benchmark for future work.\",\n}\n```\n\n","encoding":"utf-8","truncated":false,"total_bytes":1971},"status":null}