{"data":{"kind":"file","path":"README.md","version_id":"m5f4q4y4uu59a5w1wzy8er2c","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2845,"modified_at":"2026-02-20T23:11:44.213000","content_hash":"cf5fcaa94db407d0edce21fb00c45fef4f66eecba2f170f5e597000cd87700b8"},"entries":[],"content":"# MMLU-Pro Health\n\n## Overview\n- **Environment ID**: `mmlu-pro-health`\n- **Short description**: Filtered health split from MMLU-Pro\n- **Tags**: medical, clinical, single-turn, multiple-choice, test, evaluation, mmlu\n\n## Datasets\n- **Primary dataset(s)**: `MMLU-Pro`\n- **Source links**: [Paper](https://arxiv.org/pdf/2406.01574), [Github](https://github.com/TIGER-AI-Lab/MMLU-Pro), [HF Dataset](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro)\n- **Split sizes**: \n\n    | Split       | Choices         | Count   |\n    | ----------- | --------------- | ------- |\n    | `test`  | A-J    | **553**  |\n\n## Task\n- **Type**: single-turn\n- **Rubric overview**: Binary scoring based on correctly boxed letter choice and optional think tag formatting\n\n## Quickstart\nRun an evaluation with default settings:\n\n```bash\nprime eval run mmlu-pro-health -m \"openai/gpt-5-mini\" -n 5 -s\n```\n\nConfigure model and sampling (overriding some environment arguments):\n\n```bash\nmedarc-eval mmlu-pro-health -m \"openai/gpt-5-mini\" -n -1 --jitter-age\n```\n\nNotes:\n- Use direct environment flags with `medarc-eval` (for example, `--split validation` or `--judge-model gpt-5-mini`).\n- The dataset does have a `validation` split with 3 rows, but these are used as few-shot examples, following the official MMLU-Pro [eval code](https://github.com/TIGER-AI-Lab/MMLU-Pro/blob/main/evaluate_from_api.py#L173).\n- Setting `use_think` to `True` works best with `num_few_shot` of at least `1`, so that the LLM can learn exactly how it should format its answer.\n\n\n## Environment Arguments\n\n| Arg               | Type           | Default | Description                                                                                                      |\n| ----------------- | -------------- | ------- | ---------------------------------------------------------------------------------------------------------------- |\n| `num_few_shot`    | int            | `5`     | The number of few-shot examples to use (`-1` for all).                                                           |\n| `use_think`       | bool           | `False` | Use `<think>...</think>` formatting with `ThinkParser`.         |\n| `shuffle_answers` | bool           | `False` | Shuffle answers choices.                                                                               |\n| `shuffle_seed`    | int or `null`  | `1618`  | Deterministic seed for choice shuffling when `shuffle_answers` is `True` (`null` for non-deterministic).         |\n| `jitter_age`      | bool           | `False` | Add a small decimal jitter (~±2 weeks) to ages in the question text (M-ARC style). |\n\n\n## Metrics\n\n| Metric    | Meaning                                                  |\n| --------- | -------------------------------------------------------- |\n| `accuracy` | (weight 1.0): 1.0 if parsed letter is correct, else 0.0 |\n\n\n","encoding":"utf-8","truncated":false,"total_bytes":2845},"status":null}