{"data":{"kind":"file","path":"README.md","version_id":"tf28qiocvz45uirc7yr0wmy3","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3122,"modified_at":"2026-02-20T23:11:44.218000","content_hash":"60afb617f55c8316bac0495ee7facee2c4aab0f6e3b14275889edb6cc3c22350"},"entries":[],"content":"# SuperGPQA Medicine\n\n## Overview\n- **Environment ID**: `supergpqa_medicine`\n- **Short description**: Filtered medicine split from SuperGPQA\n- **Tags**: medicine, single-turn, multiple-choice, test, evaluation, supergpqa\n\n## Datasets\n- **Primary dataset**: `m-a-p/SuperGPQA` (train split, Medicine discipline only)\n- **Source links**: [Paper](https://www.arxiv.org/abs/2502.14739), [GitHub](https://github.com/SuperGPQA/SuperGPQA), [HF Dataset](https://huggingface.co/datasets/m-a-p/SuperGPQA)\n- **Split sizes**: \n\n    | Split (by difficulty)       | Choices         | Count   |\n    | ----------- | --------------- | ------- |\n    | `all`  | A-J    | **2755**  |\n    | `easy`  | A-J    | **909**  |\n    | `middle`  | A-J    | **1629**  |\n    | `hard`  | A-J    | **217**  |\n\n## Task\n- **Type**: single-turn\n- **Rubric overview**: Binary scoring based on correctly boxed letter choice and optional think tag formatting\n\n## Quickstart\nRun an evaluation with default settings:\n\n```bash\nprime eval run supergpqa_medicine -m \"openai/gpt-5-mini\" -n 5 -s\n```\n\nEnable few-shot prompting and filter to a field/difficulty using `medarc-eval`:\n\n```bash\nmedarc-eval supergpqa_medicine -m \"openai/gpt-5-mini\" -n -1 --few-shot --field clinical_medicine --difficulty hard\n```\n\nNotes:\n- Use direct environment flags with `medarc-eval` (for example, `--split validation` or `--judge-model gpt-5-mini`).\n- The dataset does have a `validation` split with 3 rows, but these are used as few-shot examples, following the official MMLU-Pro [eval code](https://github.com/TIGER-AI-Lab/MMLU-Pro/blob/main/evaluate_from_api.py#L173).\n- Set `few_shot=true` to include the fixed five-shot examples from the official setup.\n\n\n## Environment Arguments\n\n| Arg        | Type / Choices                                                            | Default | Description |\n| ---------- | ------------------------------------------------------------------------- | ------- | ----------- |\n| `field`    | `\"all\"` or one of `basic_medicine`, `clinical_medicine`, `pharmacy`, `public_health_and_preventive_medicine`, `stomatology`, `traditional_chinese_medicine` | `all` | Filter by medical field. |\n| `difficulty` | `\"all\"`, `\"easy\"`, `\"middle\"`, `\"hard\"`                                 | `all` | Filter by question difficulty. |\n| `few_shot` | bool                                                                    | `False` | Include fixed five-shot examples in prompts when `True`. |\n| `shuffle_answers` | bool                                                              | `False` | Shuffle answer choices per row. |\n| `shuffle_seed` | int or `null`                                                        | `1618` | Seed for deterministic shuffling when enabled. |\n| `jitter_age` | bool                                                                   | `False` | Add small decimal jitter (~±2 weeks) to age mentions. |\n\n## Metrics\n\n| Metric    | Meaning                                                  |\n| --------- | -------------------------------------------------------- |\n| `accuracy` | (weight 1.0): 1.0 if parsed letter is correct, else 0.0 |\n","encoding":"utf-8","truncated":false,"total_bytes":3122},"status":null}