{"data":{"kind":"file","path":"README.md","version_id":"d1nngq743s1dvg4uaiwykzf0","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3942,"modified_at":"2026-02-20T23:11:44.201000","content_hash":"f86cc2e148fcf46a68b0a5a0e84d7ff079118f6ed8ab72ad9bc32e515a2212e1"},"entries":[],"content":"# MedCalc-Bench\n\n## Overview\n- **Environment ID**: `medcalc-bench`\n- **Short description**: Evaluate clinical calculator reasoning and numeric/date outputs. Optionally equips the model with a Python execution tool or a calculator tool.\n- **Tags**: medical, clinical, single-turn, numeric, date, evaluation\n\n## Dataset\n\nTwo dataset variants are available:\n\n- `verified` (default): `nsk7153/MedCalc-Bench-Verified`\n- `v1.2`: `ncbi/MedCalc-Bench-v1.2`\n\n| Split | `v1.2` | `verified` |\n| ----- | ------ | ---------- |\n| train | 10,543 | 10,538 |\n| test  | 1,100  | 1,100  |\n\nEach example includes a `Patient Note`, `Question`, `Calculator ID`, `Ground Truth`, `Lower Bound`, and `Upper Bound`.\n\n## Task\n- **Type**: single-turn, multi-turn with tool use\n- **Prompt**: `_build_prompt(patient_note, question)` instructs `<think>...</think>` and `<answer>...</answer>`.\n- **Rubric**: `check_correctness` validates by calculator type:\n  - IDs 13, 68: date equality (MM/DD/YYYY)\n  - ID 69: tuple `(weeks, days)` equality\n  - Integer IDs: integer equality (with rounding as needed)\n  - Decimal IDs: numeric value within `[lower_bound, upper_bound]`\n\n## Quickstart\nRun an evaluation with default settings:\n\n```bash\nprime eval run medcalc-bench -m \"openai/gpt-5-mini\" -n 5 -s\n```\n\nConfigure model and sampling:\n\n```bash\nmedarc-eval medcalc-bench -m \"openai/gpt-5-mini\" -n 5 -s --one-shot --add-python-tool\n```\n\nNotes:\n- Use direct environment flags with `medarc-eval` (for example, `--split validation` or `--judge-model gpt-5-mini`).\n- Setting `use_think` to `True` works best with `one_shot` set to `True`, so that the LLM can learn exactly how it should format its answer.\n- The packaged `medarc_verifiers` XMLParser suppresses the upstream warning about `<think>` and still parses `<answer>` even if `<think>` is malformed.\n- **Tool safety**: The Python tool uses [`RestrictedPython`](https://restrictedpython.readthedocs.io/) for sandboxed execution with limited builtins (only `math`, `numpy`, `scipy` imports allowed). The calculator tool uses [`simpleeval`](https://github.com/danthedeckie/simpleeval) with only safe math operations.\n\n## Environment Arguments\n\n| Arg         | Type | Default | Description |\n| ----------- | ---- | ------- | ----------- |\n| `one_shot` | bool | `False` | Whether to use the one-shot prompt |\n| `add_python_tool` | bool | `False` | Add the Python code execution tool (uses restricted Python with limited builtins) |\n| `add_calculator_tool` | bool | `False` | Add the calculator tool (uses simple eval with safe math operations) |\n| `max_turns` | int | `20` | Maximum number of turns in tool use environment |\n| `version` | str | `\"verified\"` | Dataset variant: `\"verified\"` (default) or `\"1.2\"` |\n| `answer_format` | str | `\"xml\"` | Answer format: `\"xml\"` (default) or `\"boxed\"` |\n| `use_think` | bool | `False` | Whether to instruct `<think>...</think>` formatting |\n| `system_prompt` | str | `None` | Custom system prompt (defaults to standard XML/BOXED prompt based on `answer_format`) |\n\n\n## Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `check_correctness` | (weight 1.0): validates numeric/date/tuple answers per calc ID |\n\n## Adjustments\n\nAdjusted the prompt to output the step-by-step thinking and final answer with the <think> and <answer> tags instead of responding with a JSON.\n\n\n## References\n\n```bibtex\n@misc{khandekar2024medcalcbench,\n      title={MedCalc-Bench: Evaluating Large Language Models for Medical Calculations},\n      author={Nikhil Khandekar and Qiao Jin and Guangzhi Xiong and Soren Dunn and Serina S Applebaum and Zain Anwar and Maame Sarfo-Gyamfi and Conrad W Safranek and Abid A Anwar and Andrew Zhang and Aidan Gilson and Maxwell B Singer and Amisha Dave and Andrew Taylor and Aidong Zhang and Qingyu Chen and Zhiyong Lu},\n      year={2024},\n      eprint={2406.12036},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https://arxiv.org/abs/2406.12036},\n}\n```\n","encoding":"utf-8","truncated":false,"total_bytes":3942},"status":null}