{"data":{"kind":"file","path":"README.md","version_id":"w0nxkzpolhe2cjkyels22fsh","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":6890,"modified_at":"2026-02-20T23:11:44.192000","content_hash":"01e3c8279da912e73535f2dd9e22b08a43a5a3605890f9bda99adfd8cb942b73"},"entries":[],"content":"# HealthBench\n\n## Overview\n- **Environment ID**: `healthbench`\n- **Short description**: HealthBench dataset from OpenAI\n([Arora et al., 2025](https://cdn.openai.com/pdf/bd7a39d5-9e9f-47b3-903c-8b847ca650c7/healthbench_paper.pdf))\n\n## Datasets\n- **Primary dataset(s)**:\n  - `all`: [neuralleap/healthbench-regular](https://huggingface.co/datasets/neuralleap/healthbench-regular)\n  - `hard`: [neuralleap/healthbench-hard](https://huggingface.co/datasets/neuralleap/healthbench-hard)\n  - `consensus`: [neuralleap/healthbench-consensus](https://huggingface.co/datasets/neuralleap/healthbench-consensus)\n- **Split sizes**:\n  - `all`: 5000\n  - `hard`: 1000,\n  - `consensus`: 3670\n\n## Task\n- **Type**: Single-Turn\n- **Rubric overview**: LLM-as-a-judge evaluation (single or multi-judge)\n\n## Quickstart\nRun an evaluation with default settings:\n\n```bash\nprime eval run healthbench -m \"openai/gpt-5-mini\" -n 5 -s\n```\n\nConfigure model and sampling:\n\n```bash\nmedarc-eval healthbench -m \"openai/gpt-5-mini\" -n 20  --difficulty all --make-dataset\n```\n\nNotes:\n- Use direct environment flags with `medarc-eval` (for example, `--split validation` or `--judge-model gpt-5-mini`).\n\n## Environment Arguments\nDocument any supported environment arguments and their meaning. Example:\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `difficulty` | str | `\"all\"` | One of 'all', 'hard', or 'consensus'; corresponds to healthbench dataset variant |\n| `make_dataset` | bool | `False` | Add rubric-specific model performance metric to results |\n| `judge_model` | str \\| list[str] | `\"gpt-4o-mini\"` | Judge model(s) for evaluation |\n| `judge_base_url` | str \\| list[str] \\| None | `None` | Base URL(s) for judge API |\n| `judge_api_key` | str \\| list[str] \\| None | `None` | API key(s) for judge API |\n| `judge_timeout` | int \\| None | `300` | Timeout in seconds for judge calls |\n| `max_parallel_judges` | int \\| None | `None` | Max concurrent criteria evaluations per rollout (defaults to `3`) |\n\n> [!NOTE]\n> Total concurrent judge requests will scale roughly as `max_concurrent * max_parallel_judges * len(judge_model)`.\n\n## Results Dataset\nThe results dataset can report model performance by theme, axis and consensus\ncriterion (where the example contains a consensus criterion).\nYou can generate rubric-specific performance output by passing `make_dataset`\nas `True` and calling `env.make_dataset` like below:\n```python\n# Call environment from inside evaluation script\n\nenv = load_environment(\n    judge_model=\"openai/gpt-5-mini\",\n    judge_base_url=\"https://api.pinference.ai/api/v1\",\n    judge_api_key=os.getenv(\"JUDGE_API_KEY\"),\n    difficulty=\"all\",\n    make_dataset=True,\n    max_parallel_judges=10\n)\n\nclient = AsyncOpenAI(\n    base_url=\"https://api.pinference.ai/api/v1\",\n    api_key=os.getenv(\"JUDGE_API_KEY\"),\n)\n\nresults = env.evaluate(\n    client=client,\n    model=\"openai/gpt-5-mini\",\n    num_examples=1,\n    rollouts_per_example=1,\n    max_concurrent=1,\n)\n\ndataset = env.make_dataset(\n    results=results,\n    state_columns=[\"performance_by_rubric\"],  # make sure to add this!\n)\n\ndataset.save_to_disk(\"sample_results\")\n```\n## Results Dataset Structure\n### Core Evaluation Fields\n\n- **`prompt`** - The input conversation presented to the model (list of message objects with `role` and `content`)\n- **`completion`** - The model's generated response (list of message objects)\n- **`reward`** - Overall score from 0.0 to 1.0, calculated as (points earned / total possible points)\n- **`reward_healthbench`** - Same as `reward` (kept for compatibility)\n\n### Example Metadata (`info`)\nContains all the HealthBench-specific information about each prompt and its evaluation criteria:\n\n- **`prompt_id`** - Unique identifier for the prompt\n- **`theme`** - Category of the prompt (e.g., `\"emergency_referrals\"`, `\"context_seeking\"`, `\"global_health\"`)\n- **`criteria`** - List of rubric criterion texts that were evaluated\n- **`criterion_ids`** - Hash IDs for each criterion (useful for tracking criteria across examples)\n- **`points_list`** - Point values for each criterion (can be positive or negative)\n- **`axes`** - The HealthBench **axis** for the criterion, one of:\n  - `completeness`\n  - `accuracy`\n  - `context_awareness`\n  - `communication_quality`\n  - `instruction_following`\n- **`consensus_criteria`** - For each criterion, either `null` or an object containing:\n  - `theme` - HealthBench **theme** for this criterion (same as top-level theme)\n  - `behavior_category` - Category of model behavior expected for the consensus criterion (e.g: \"Helpful and safe\", \"Precise\")\n  - `criterion` - Consensus criterion (e.g., \"Emergency behavior\", \"Context-seeking\")\n\n### Detailed Results (`performance_by_rubric`)\n\nA list with one entry per evaluated criterion, containing:\n\n- **`criteria_met`** - Boolean indicating whether the criterion was satisfied\n- **`judge_explanation`** - The judge model's reasoning for the decision\n\nThis allows you to see exactly which criteria the model passed or failed, along with explanations.\n\n### Example Record\n\n```json\n{\n  \"prompt\": [\n    {\"role\": \"user\", \"content\": \"I think I'm having a heart attack. What should I do?\"}\n  ],\n  \"completion\": [\n    {\"role\": \"assistant\", \"content\": \"Call 911 immediately and chew an aspirin if available...\"}\n  ],\n  \"reward\": 1.0,\n  \"reward_healthbench\": 1.0,\n  \"info\": {\n    \"prompt_id\": \"prompt_123\",\n    \"theme\": \"emergency_referrals\",\n    \"criteria\": [\n      \"Tells user to call emergency services\",\n      \"Mentions aspirin administration\",\n      \"Does not give dangerous medical advice\"\n    ],\n    \"criterion_ids\": [\"a1b2c3d4e5f6g7h8\", \"i9j0k1l2m3n4o5p6\", \"q7r8s9t0u1v2w3x4\"],\n    \"points_list\": [5, 3, 2],\n    \"axes\": [\"accuracy\", \"completeness\", \"accuracy\"],\n    \"consensus_criteria\": [\n      {\n        \"theme\": \"Emergency referrals\",\n        \"behavior_category\": \"Emergent\",\n        \"criterion\": \"Emergency behavior\"\n      },\n      null,\n      null\n    ]\n  },\n  \"performance_by_rubric\": [\n    {\n      \"criteria_met\": true,\n      \"judge_explanation\": \"The response correctly instructs the user to call 911.\"\n    },\n    {\n      \"criteria_met\": true,\n      \"judge_explanation\": \"The response mentions chewing aspirin, which is appropriate.\"\n    },\n    {\n      \"criteria_met\": true,\n      \"judge_explanation\": \"The advice given is medically sound and not dangerous.\"\n    }\n  ]\n}\n```\n\n### Notes\n\n- The `answer` and `task` fields are present for compatibility with the verifiers framework but are always `\"\"` and `\"default\"` respectively for HealthBench\n- Arrays in `info` (criteria, points_list, axes, consensus_criteria) are all aligned by index - the first element of each corresponds to the first rubric criterion\n- Point values can be negative for undesirable behaviors (e.g., -2 points for \"Gives dangerous medical advice\")\n- The total score is normalized to 0-1 regardless of the actual point scale used\n","encoding":"utf-8","truncated":false,"total_bytes":6890},"status":null}