{"data":{"kind":"file","path":"README.md","version_id":"i40sdwbhu8ntfx2o3wmtad9n","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2915,"modified_at":"2026-02-20T23:11:44.201000","content_hash":"406814f21ea60f4463d48bbba8494a63a82bed941556210aba76d76f25e33f31"},"entries":[],"content":"# medbullets\n\n## Overview\n- **Environment ID**: `medbullets`\n- **Short description**: USMLE-style multiple-choice questions from Medbullets.\n- **Tags**: medical, clinical, single-turn, multiple-choice, USMLE, train, evaluation\n\n## Datasets\n- **Primary dataset(s)**: `Medbullets-4` and `Medbullets-5`\n- **Source links**: [Paper](https://arxiv.org/pdf/2402.18060), [Github](https://github.com/HanjieChen/ChallengeClinicalQA), [HF Dataset](https://huggingface.co/datasets/mkieffer/Medbullets)\n- **Split sizes**:\n\n    | Split       | Choices         | Count   |\n    | ----------- | --------------- | ------- |\n    | `op4_test` | {A, B, C, D}    | **308** |\n    | `op5_test` | {A, B, C, D, E} | **308** |\n\n    `op5_test` contains the same content as `op4_test`, but with one additional answer choice to increase difficulty. Note that while the content is the same, the letter choice corresponding to the correct answer is sometimes different between these splits.\n\n\n## Task\n- **Type**: single-turn\n- **Rubric overview**: Binary scoring based on correctly boxed letter choice and optional think tag formatting\n\n## Quickstart\nRun an evaluation with default settings:\n\n```bash\nprime eval run medbullets -m \"openai/gpt-5-mini\" -n 5 -s\n```\n\nConfigure model and sampling:\n\n```bash\nmedarc-eval medbullets -m \"openai/gpt-5-mini\" -n -1 --num-options 5 --shuffle-answers --shuffle-seed 1618 --answer-format boxed\n```\n\nNotes:\n- Use direct environment flags with `medarc-eval` (for example, `--split validation` or `--judge-model gpt-5-mini`).\n\n## Environment Arguments\nDocument any supported environment arguments and their meaning. Example:\n\n| Arg                  | Type | Default | Description                                                                                                                                                                          |\n| -------------------- | ---- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |\n| `num_options`        | int  | `4`     | Number of options: `4` → {A, B, C, D}; `5` → {A, B, C, D, E}                                                |\n| `use_think`          | bool | `False` | Whether to check for `<think>...</think>` formatting with `ThinkParser`|\n| `shuffle_answers`    | bool | `False` | Whether to shuffle answer choices |\n| `shuffle_seed`       | int \\| None | `1618` | Seed for deterministic answer shuffling |\n| `answer_format`      | str | `\"boxed\"` | Output parser format: `\"boxed\"` or `\"xml\"` |\n\n\n## Metrics\nSummarize key metrics your rubric emits and how they’re interpreted.\n\n| Metric | Meaning |\n| ------ | ------- |\n| `correct_answer_reward_func` | (weight 1.0): 1.0 if parsed letter is correct, else 0.0|\n| `parser.get_format_reward_func()` | (weight 0.0): optional format adherence (not counted) |\n","encoding":"utf-8","truncated":false,"total_bytes":2915},"status":null}