{"data":{"kind":"file","path":"README.md","version_id":"wxw6keq2rq57vgzh8tgx5hv4","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3297,"modified_at":"2026-02-20T23:11:44.192000","content_hash":"edf53408c1d1c259a413a58ba468a6c15302543d27155c12d032d53bdd423e1e"},"entries":[],"content":"# LongHealth\n\n## Overview\n- **Environment ID**: `longhealth`\n- **Short description**: Evaluates LLM ability to process and extract information from long clinical documents (5K-6.7K words per case)\n- **Tags**: medical, clinical, single-turn, multiple-choice, long-context, eval\n\n\n## Datasets\n- **Paper**: https://arxiv.org/abs/2401.14490\n- **Dataset / Project**: https://github.com/kbressem/LongHealth\n- **Split sizes**: 400 questions total (20 patients × 20 questions each). No official train/val splits\n- **Context length**: 5,090 to 6,754 words per patient case\n\n## Benchmarks Tasks\n\n### Task 1: Information Extraction\n- Tests ability to extract correct information from long clinical documents\n- Answer is ALWAYS present in provided documents\n- 5-option MCQ (A/B/C/D/E)\n\n### Task 2: Negation Detection & Hallucination Prevention\n- Tests ability to identify when information is NOT available\n- Creates pairs of examples:\n  - **Negation**: Only distractor documents (should answer F: \"Cannot be answered\")\n  - **Identification**: Answer docs + distractors (should answer correctly)\n- 6-option MCQ (A/B/C/D/E/F)\n\n### Task 3: Temporal Reasoning\n- Embedded within Task 2 framework\n- Questions focus on chronological ordering and temporal relationships\n- Tests understanding of event sequences in medical timelines\n\n## Quickstart\nRun an evaluation with default settings (Task 1, first 10 examples):\n\n```bash\nprime eval run longhealth -m \"openai/gpt-5-mini\" -n 5 -s\n```\n\nConfigure model and task:\n\n```bash\nmedarc-eval longhealth -m \"openai/gpt-5-mini\" -n 20 -s --task task2\n\nmedarc-eval longhealth -m \"openai/gpt-5-mini\" -n 10 -s --task all --doc-shuffle-seed 2718 --max-context-tokens 30000\n```\n\n## Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `task` | str | `\"task1\"` | Which task(s): `\"task1\"` (extraction), `\"task2\"` (negation), `\"all\"` (both) |\n| `max_context_tokens` | int | `16000` | Maximum tokens for document context |\n| `shuffle_docs` | bool | `True` | Shuffle document order to test positional bias |\n| `doc_shuffle_seed` | int \\| None | `-1` | Seed for document shuffling (`-1` for nondeterministic order each run) |\n| `shuffle_answers` | bool | `False` | Shuffle answer options |\n| `shuffle_seed` | int \\| None | `1618` | Seed for answer shuffling |\n| `max_examples` | int | `-1` | Limit number of examples (-1 for all) |\n\n## Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Exact match accuracy (1.0 if correct letter, 0.0 otherwise) |\n| `info.task` | Which sub-task: `task1`, `task2_negation`, or `task2_identification` |\n| `info.has_answer_docs` | Whether answer-containing documents were included |\n| `info.num_docs` | Number of documents in the context |\n\n## Example Usage\n\n```python\nimport verifiers as vf\n\n# Load Task 1\nenv = vf.load_environment(\"longhealth\", task=\"task1\")\n\n# Load Task 2 with custom settings\nenv = vf.load_environment(\n    \"longhealth\",\n    task=\"task2\",\n    max_context_tokens=14000,\n    shuffle_docs=True\n)\n\n# Run evaluation programmatically\nfrom openai import AsyncOpenAI\nclient = AsyncOpenAI()\nresults = await env.evaluate(client, \"gpt-4.1-mini\", num_examples=10)\n```\n\n## Authors\nThis environment has been put together by:\n\nShamus Sim Zi Yang - ([@ss8319](https://github.com/ss8319))\n","encoding":"utf-8","truncated":false,"total_bytes":3297},"status":null}