{"data":{"kind":"file","path":"README.md","version_id":"t29kmeli2kk33egrnz3v9t6g","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3642,"modified_at":"2026-02-15T18:57:24.416000","content_hash":"dbb3e9cb530127bf283a5c9e2be546d79f4f98dc0508b9d2a58cbdd266a44824"},"entries":[],"content":"# GPQA Diamond\r\n\r\n**Environment ID:** `gpqa_diamond`  \r\n**Short description:** A graduate-level, Google-proof benchmark consisting of very difficult multiple-choice questions in biology, physics, and chemistry, written and validated by PhD-level experts.\r\n\r\n---\r\n\r\n## Overview\r\n\r\nGPQA Diamond is the highest-quality subset of the GPQA benchmark. It contains graduate-level science questions designed to be difficult even for skilled non-experts with internet access.\r\n\r\nEach question:\r\n* Was written by a domain expert (PhD-level).\r\n* Was validated by two additional experts.\r\n* Was failed by the majority of highly-skilled non-expert validators (who had 30+ minutes and Google access).\r\n\r\nThe environment evaluates models on their ability to reason through complex scientific problems and select the correct answer among four options.\r\n\r\n**Tags:** `gpqa`, `science`, `expert-level`, `benchmark`, `evaluation`, `physics`, `chemistry`, `biology`\r\n\r\n---\r\n\r\n## Datasets\r\n\r\n**Primary dataset:** `Idavidrein/gpqa` (HuggingFace)  \r\n**Subset used:** `gpqa_diamond`  \r\n\r\n**Source links:**\r\n* Dataset: [https://huggingface.co/datasets/Idavidrein/gpqa](https://huggingface.co/datasets/Idavidrein/gpqa)\r\n* Paper: [GPQA: A Graduate-Level Google-Proof Q&A Benchmark](https://arxiv.org/abs/2311.12022)\r\n\r\n**Split sizes:**\r\n* Diamond: 198 questions (stored in the `train` split on HuggingFace).\r\n\r\n---\r\n\r\n## Dependencies\r\n\r\nThis environment depends on:\r\n- `verifiers`\r\n- `datasets`\r\n\r\nInstalled automatically via `uv run vf-install gpqa_diamond`.\r\n\r\n---\r\n\r\n## Task\r\n\r\n**Type:** single-turn  \r\n**Parser:** `verifiers.utils.data_utils.extract_boxed_answer`  \r\n**Rubric overview:** \r\n* **Primary Reward:** 1.0 if the model's chosen letter (A, B, C, or D) matches the deterministically shuffled correct answer.\r\n* **Informational Metrics:** Domain and Difficulty are tracked but do not affect the score.\r\n\r\n---\r\n\r\n## Quickstart\r\n\r\nRun evaluation on all GPQA-Diamond questions:\r\n```bash\r\nuv run vf-eval -s gpqa_diamond -m openai/gpt-4o -s\r\n\r\n```bash\r\nuv run vf-eval gpqa_diamond -m openai/gpt-4o -b https://api.pinference.ai/api/v1 -k PRIME_API_KEY -s\r\n```\r\n\r\nRun on a small subset (5 questions):\r\n\r\n```bash\r\nuv run vf-eval gpqa_diamond -m openai/gpt-4o -b https://api.pinference.ai/api/v1 -k PRIME_API_KEY -n 5 -s\r\n```\r\n\r\nUsing another model:\r\n\r\n```bash\r\nuv run vf-eval gpqa_diamond -m openai/o1-mini -b https://api.pinference.ai/api/v1 -k PRIME_API_KEY -n 5 -s\r\n```\r\n\r\nRun using a specific dataset split:\r\n\r\n```bash\r\nuv run vf-eval gpqa_diamond -m openai/gpt-4o -b https://api.pinference.ai/api/v1 -k PRIME_API_KEY -a '{\"split\": \"train\"}' -s\r\n```\r\n\r\n---\r\n\r\n## Notes\r\n\r\n* Choices are shuffled per question to avoid positional bias.\r\n* Correct answer letter is derived after shuffling.\r\n* Questions come exclusively from the expert-validated GPQA Diamond subset.\r\n* Evaluation is based on letter matching (not answer text matching).\r\n\r\n## Environment Arguments\r\n\r\n| Arg   | Type | Default   | Description                                                |\r\n| ----- | ---- | --------- | ---------------------------------------------------------- |\r\n| split | str  | `\"train\"` | Which dataset split to load (HF stores Diamond in `train`) |\r\n\r\n---\r\n\r\n## Metrics\r\n\r\n| Metric     | Meaning                                                      |\r\n| ---------- | ------------------------------------------------------------ |\r\n| reward     | Binary: 1.0 if the model selects the expert-validated answer |\r\n| domain     | Metadata: High-level field (Physics, Chemistry, Biology)     |\r\n| difficulty | Metadata: Writer’s difficulty estimate                       |\r\n\r\n---","encoding":"utf-8","truncated":false,"total_bytes":3642},"status":null}