{"data":{"kind":"file","path":"README.md","version_id":"c4bl7vy10gccvbbx689wme00","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3181,"modified_at":"2026-02-20T23:11:44.192000","content_hash":"5b346f7432cb3477e7377767e94bba9b89fd07c792bb5a8338fb9b3c8a4d5f04"},"entries":[],"content":"# M-ARC\n\n## Overview\n- **Environment ID**: `m-arc`\n- **Short description**: Long-tail medical questions requiring flexible, clinical reasoning. \n- **Tags**: medical, clinical, single-turn, multiple-choice, test, evaluation\n\n## Datasets\n- **Primary dataset**: `M-ARC`\n- **Source links**: [Paper](https://arxiv.org/pdf/2502.04381), [Github](https://github.com/dbernardo05/medARC-QA), [HF Dataset](https://huggingface.co/datasets/mkieffer/M-ARC)\n- **Split sizes**: \n\n    | Split       | Choices         | Count   |\n    | ----------- | --------------- | ------- |\n    | `test`  | A-G    | **100**  |\n\n- **Few-shot dataset**: `MMLU-Pro-Health`\n- **Source links**: [Paper](https://arxiv.org/pdf/2406.01574), [Github](https://github.com/TIGER-AI-Lab/MMLU-Pro), [HF Dataset](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro)\n- **Split sizes**: \n\n    | Split       | Choices         | Count   |\n    | ----------- | --------------- | ------- |\n    | `validation`  | A-J    | **3**  |\n\n## Task\n- **Type**: single-turn\n- **Rubric overview**: Binary scoring based on correctly boxed letter choice and optional think tag formatting\n\n## Quickstart\nRun an evaluation with default settings:\n\n```bash\nprime eval run m-arc -m \"openai/gpt-5-mini\" -n 5 -s\n```\n\nConfigure model and sampling:\n\n```bash\nmedarc-eval m-arc -m \"openai/gpt-5-mini\" -n -1 -s --num-few-shot 1 --shuffle-answers --shuffle-seed 1618\n```\n\nNotes:\n- Use direct environment flags with `medarc-eval` (for example, `--split validation` or `--judge-model gpt-5-mini`).\n- The official M-ARC [eval code](https://github.com/dbernardo05/medARC-QA/blob/main/evaluate_from_api.py#L253) loads the entire MMLU-Pro `validation` split to use as few-shot examples. Here, however, we only use rows from the health category, in line with how the official MMLU-Pro [eval code](https://github.com/TIGER-AI-Lab/MMLU-Pro/blob/main/evaluate_from_api.py#L225) filters by category.\n- Setting `use_think` to `True` works best with `num_few_shot` of at least `1`, so that the LLM can learn exactly how it should format its answer.\n\n\n## Environment Arguments\n\n| Arg                  | Type | Default | Description                                                                                                                                                                          |\n| -------------------- | ---- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |\n| `num_few_shot`  | int  | `5`    | The number of few-shot examples to use (`-1` for all)                                                                                                                                |\n| `use_think`          | bool | `False` | Whether to check for `<think>...</think>` formatting with `ThinkParser`|\n| `shuffle_answers`    | bool | `False` | Whether to shuffle answer choices |\n| `shuffle_seed`       | int \\| None | `1618` | Seed for deterministic answer shuffling |\n\n\n## Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `correct_answer_reward_func` | (weight 1.0): 1.0 if parsed letter is correct, else 0.0|\n","encoding":"utf-8","truncated":false,"total_bytes":3181},"status":null}