{"data":{"kind":"file","path":"README.md","version_id":"y34fo5fmxv26btdqn1m4ewe0","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3555,"modified_at":"2026-02-20T23:11:44.218000","content_hash":"b00da1acab3f89ce4b3ae0f10cb2e1a953dd39f6365c6d577a7683fe05d44e49"},"entries":[],"content":"# PubHealthBench\n\nEvaluation environment for the [Joshua-Harris/PubHealthBench](https://huggingface.co/datasets/Joshua-Harris/PubHealthBench) dataset.\n\n## Overview\n- **Environment ID**: `pubhealthbench`\n- **Short description**: Public health MCQ and free-form evaluation derived from UK Health Security Agency (UKHSA) guidance documents\n- **Tags**: medical, public-health, single-turn, multiple-choice, llm-judge, eval\n\n## Datasets\n\nPubHealthBench contains public health questions derived from UK Health Security Agency (UKHSA) guidance documents. Questions cover topics including:\n- Gastro/food safety\n- Chemicals/toxicology\n- Vaccine-preventable diseases and immunisation\n- And more\n\n## Splits\n\n| Split | Type | Questions | Description |\n|-------|------|-----------|-------------|\n| `full` | MCQ | 7,929 | Full test set |\n| `validation` | MCQ | 161 | Validation set |\n| `reviewed` | MCQ | 760 | Human-reviewed questions (default) |\n| `freeform` | LLM-as-judge | 760 | Reviewed set with open-ended evaluation |\n| `freeform_valid` | LLM-as-judge | 161 | Validation set with open-ended evaluation |\n\n## Quickstart\n\nInstall:\n\n```bash\nvf-install pubhealthbench\n```\n\nRun MCQ evaluation (default: reviewed split):\n\n```bash\nprime eval run pubhealthbench -m \"openai/gpt-5-mini\" -n 5 -s\n```\n\nUse full test split:\n\n```bash\nmedarc-eval pubhealthbench --split full -m \"openai/gpt-5-mini\" -n 10\n```\n\nWith answer shuffling:\n\n```bash\nmedarc-eval pubhealthbench --shuffle-answers -m \"openai/gpt-5-mini\" -n 10\n```\n\nFreeform (LLM-as-judge) single-judge evaluation:\n\n```bash\nmedarc-eval pubhealthbench --split freeform -m \"openai/gpt-5-mini\" -n 10 --judge-model \"openai/gpt-5-mini\"\n```\n\nThis environment supports multi-judge via the `medarc-verifiers` `MultiJudge` rubric. When multiple judge models are specified, each scores independently and the final reward is their mean. `judge_model`, `judge_base_url`, and `judge_api_key` are all list-valued and correspond positionally; if only one `judge_base_url` or `judge_api_key` is provided, it is used for all judge models.\n\n\n```bash\nmedarc-eval pubhealthbench --split freeform -m \"openai/gpt-5-mini\" -n 10 --judge-model \"openai/gpt-5-mini\" --judge-model \"google/gemini-3-flash-preview\"\n```\n\n## Environment Arguments\n\n| Parameter | Type | Default | Description |\n|-----------|------|---------|-------------|\n| `split` | str | \"reviewed\" | Dataset split (see table above) |\n| `shuffle_answers` | bool | False | Randomize answer option order (MCQ only) |\n| `shuffle_seed` | int | 1618 | Seed for deterministic shuffling (MCQ only) |\n| `answer_format` | str | \"xml\" | Answer format: \"xml\" or \"boxed\" (MCQ only) |\n| `judge_model` | str \\| list[str] | \"gpt-4o-mini\" | Judge model(s) for freeform evaluation |\n| `judge_base_url` | str \\| list[str] | None | Base URL(s) for judge API |\n| `judge_api_key` | str \\| list[str] | None | API key(s) for judge |\n\n## Authors\nThis environment has been put together by:\n\nBenjamin Warner - ([@warner-benjamin](https://github.com/warner-benjamin))\n\n## Citation\nDataset:\n```bibtex\n@misc{harris2025healthyllmsbenchmarkingllm,\n      title={Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information},\n      author={Joshua Harris and Fan Grayson and Felix Feldman and Timothy Laurence and Toby Nonnenmacher and Oliver Higgins and Leo Loman and Selina Patel and Thomas Finnie and Samuel Collins and Michael Borowitz},\n      year={2025},\n      eprint={2505.06046},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https://arxiv.org/abs/2505.06046},\n}\n```\n","encoding":"utf-8","truncated":false,"total_bytes":3555},"status":null}