{"data":{"kind":"file","path":"README.md","version_id":"e3thdz23ofhynanzkuu6oo5t","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3587,"modified_at":"2026-02-20T23:11:44.196000","content_hash":"6c5f27b48e8816987a97432b7a7d5923f2e2c22d9bc4026f06decf5aee2f16f9"},"entries":[],"content":"# Med-HALT\n\n## Overview\n- **Environment ID**: `med_halt`\n- **Short description**: Evaluate clinical reasoning hallucinations using Med-HALT’s FCT (False Confidence Test) and NOTA (None of the Above) tasks.\n- **Tags**: medical, hallucination, single-turn, multiple-choice, clinical reasoning, evaluation\n\n## Datasets\n- **Primary dataset(s)**: `Med-HALT` (reasoning subset)\n- **Source links**: [Paper](https://arxiv.org/abs/2307.15343), [HF Dataset](https://huggingface.co/datasets/openlifescienceai/Med-HALT)\nThe upstream dataset ships only a **single split**, but internally includes a `split_type` field (`train` / `val` / `test`).  \nThis environment **filters to the `val` subset**, consistent with MedARC evaluation standards.\n\n- **Split sizes**:\n\n    | Test Type       | Examples  | Description |\n    | --------------- | --------- | ----------- |\n    | `reasoning_FCT` | **5,154** | False Confidence Test - evaluates if models can correctly assess proposed answers |\n    | `reasoning_nota` | **5,154** | None of the Above Test - tests if models can identify when no option is correct |\n    | `reasoning_fake` | _Not used_ | Fake Questions Test - assesses if models can recognize nonsensical questions (excluded by default) |\n\nNotes:\n- FCT’s validation subset is **purposely imbalanced**: almost all examples contain *incorrect* student answers.  \n- This is expected and documented behavior of the source dataset.\n\n\n## Task\n- **Type**: single-turn\n- **Rubric overview**: Binary scoring based on correct letter choice (A, B, C, D, etc.)\n\n## Quickstart\nRun an evaluation with default settings (FCT split):\n\n```bash\nprime eval run med_halt -m \"openai/gpt-5-mini\" -n 5 -s\n```\n\nConfigure model and sampling:\n\n```bash\nmedarc-eval med_halt -m \"openai/gpt-5-mini\" -n 100 -s --question-type reasoning_nota --num-few-shot 3 --no-shuffle-answers\n```\n\nNotes:\n- Use direct environment flags with `medarc-eval` (for example, `--split validation` or `--judge-model gpt-5-mini`).\n\n## Environment Arguments\n\n| Arg                  | Type | Default | Description                                                                                                                                                                          |\n| -------------------- | ---- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |\n| `system_prompt`      | str \\| None | `None` | Custom system prompt (defaults to standard XML/BOXED prompt based on `answer_format`) |\n| `shuffle_answers`    | bool | `False` | Whether to shuffle answer choices |\n| `shuffle_seed`       | int \\| None | `1618` | Random seed for reproducible answer shuffling |\n| `question_type`      | str  | `\"reasoning_fct\"` | Question family to evaluate: `\"reasoning_fct\"` or `\"reasoning_nota\"` |\n| `num_few_shot`       | int  | `2` | Number of few-shot examples to include in prompts |\n\n## Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `accuracy` | 1.0 if parsed letter matches the gold letter, else 0.0 |\n\n## Credits\n\nDataset:\n\n```bibtex\n@article{jeblick2023medhalt,\n  title={Med-HALT: Medical Domain Hallucination Test for Large Language Models},\n  author={Jeblick, Konstantin and Schachtner, Balthasar and Dexl, Jakob and \n          Mittermeier, Andreas and St{\\\"u}ber, Anna Theresa and Topalis, Johanna and\n          Weber, Tobias and Wesp, Philipp and Sabel, Bastian Oliver and \n          Ricke, Jens and Ingrisch, Michael},\n  journal={arXiv preprint arXiv:2307.15343},\n  year={2023}\n}\n```\n","encoding":"utf-8","truncated":false,"total_bytes":3587},"status":null}