{"data":{"kind":"file","path":"README.md","version_id":"vvljb7dae5nctbfdyxxk73k6","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":1753,"modified_at":"2026-02-20T23:11:44.209000","content_hash":"f6a5f47d4af22e7c14ac07d6120f98d48c9b864dd187872c5720f1b31d20396d"},"entries":[],"content":"# medxpertqa\n\n## Overview\n- **Environment ID**: `medxpertqa`\n- **Short description**: MedXpertQA is a highly challenging and comprehensive benchmark designed to evaluate expert-level medical knowledge and advanced reasoning capabilities. We only use the text subset for now.\n- **Tags**: mcq\n\n## Datasets\n- **Primary dataset(s)**: TsinghuaC3I/MedXpertQA\n- **Source links**: [HuggingFace](https://huggingface.co/datasets/TsinghuaC3I/MedXpertQA)\n- **Split sizes**: test subset - 2.45k rows\n\n## Task\n- **Type**: single-turn\n- **Rubric overview**: Binary scoring (1.0 / 0.0) based on correct letter or answer text match\n\n## Quickstart\nRun an evaluation with default settings:\n\n```bash\nprime eval run medxpertqa -m \"openai/gpt-5-mini\" -n 5 -s\n```\n\nConfigure model and sampling:\n\n```bash\nmedarc-eval medxpertqa -m \"openai/gpt-5-mini\" -n 20 --answer-format boxed\n```\n\nNotes:\n- Use direct environment flags with `medarc-eval` (for example, `--split validation` or `--judge-model gpt-5-mini`).\n\n## Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `question_type` | str | `\"all\"` | Question subset to evaluate (e.g., all, text-only subset variants supported by the environment). |\n| `use_think` | bool | `False` | Whether to expect reasoning in `<think>...</think>` with boxed answers. |\n| `shuffle_answers` | bool | `False` | Whether to shuffle answer options per question. |\n| `shuffle_seed` | int \\| None | `1618` | Seed for deterministic answer shuffling. |\n| `answer_format` | str | `\"xml\"` | Output format parser to use (`xml` or `boxed`). |\n\n## Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Main scalar reward (weighted sum of criteria) |\n| `accuracy` | Exact match on target answer |\n","encoding":"utf-8","truncated":false,"total_bytes":1753},"status":null}