{"data":{"kind":"file","path":"README.md","version_id":"g8mb7l65ke7hirvz6t7341b8","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3783,"modified_at":"2026-02-20T23:11:44.188000","content_hash":"6418f407c8c46498a6b36801f878a0ff825492e960d4e638bd82e78d647fb788"},"entries":[],"content":"# careqa\n\nEvaluation environment for the [HPAI-BSC/CareQA](https://huggingface.co/datasets/HPAI-BSC/CareQA) dataset.\n\n## Overview\n- **Environment ID**: `careqa`  \n- **Short description**: CareQA is a healthcare QA dataset with **multiple-choice** and **open-ended clinical reasoning questions**. This environment supports both modes through the `split` parameter.  \n- **Tags**: healthcare, medical QA, clinical reasoning, MCQ, single-turn\n\n## Datasets\n- **Primary dataset(s)**:\n  - `CareQA_en` – multiple-choice clinical questions with 4 options and correct answer labels\n  - `CareQA_en_open` – open-ended clinical questions with reference answers\n- **Source links**:\n  - [Hugging Face CareQA dataset](https://huggingface.co/datasets/HPAI-BSC/CareQA)\n\n## Task\n- **Type**: single-turn\n  - MCQ mode: `vf.Parser()` or `vf.ThinkParser()` for extracting boxed answers\n  - Open-ended mode: `XMLParser()` for judge responses\n- **Rubric overview**:\n  - **MCQ mode (`en`)**: `vf.Rubric()` measuring **accuracy** (letter match A–D)\n  - **Open-ended mode (`open`)**: LLM-as-judge scoring (single or multi-judge)\n\n## Quickstart\n\n**Multiple-choice evaluation:**\n```bash\nprime eval run careqa -m \"openai/gpt-5-mini\" -n 5 -s -a '{\"split\": \"en\"}'\n```\n\n**Open-ended evaluation:**\n```bash\nmedarc-eval careqa --split open -m \"openai/gpt-5-mini\" -n 10 -s --judge-model \"openai/gpt-5-mini\"\n```\n\nThis environment supports multi-judge via the `medarc-verifiers` `MultiJudge` rubric. When multiple judge models are specified, each scores independently and the final reward is their mean. `judge_model`, `judge_base_url`, and `judge_api_key` are all list-valued and correspond positionally; if only one `judge_base_url` or `judge_api_key` is provided, it is used for all judge models.\n\n**With configured default judges for open-ended mode:**\n```bash\nmedarc-eval careqa --split open -m \"openai/gpt-5-mini\" -n 10 -s \\\n  --judge-model \"openai/gpt-5-mini\" \\\n  --judge-model \"google/gemini-3-flash-preview\"\n```\n\n**With shuffled answer options (MCQ only):**\n```bash\nmedarc-eval careqa --split en --shuffle-answers --shuffle-seed 1618 -m \"openai/gpt-5-mini\" -n 10 -s\n```\n\n## Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `split` | str | Required | Mode: `en` (multiple-choice) or `open` (open-ended) |\n| `system_prompt` | str \\| None | `None` | Custom system prompt (uses mode-appropriate default if not specified) |\n| `shuffle_answers` | bool | `False` | Randomly shuffle answer options (MCQ only) |\n| `shuffle_seed` | int \\| None | `1618` | Seed for answer shuffling (MCQ only) |\n| `judge_model` | str \\| list[str] | `\"gpt-4o-mini\"` | Model(s) for LLM-as-judge evaluation (open-ended only) |\n| `judge_base_url` | str \\| list[str] \\| None | `None` | Base URL(s) for judge API |\n| `judge_api_key` | str \\| list[str] \\| None | `None` | API key(s) for judge (falls back to `OPENAI_API_KEY` env var) |\n\n## Metrics\n\n### MCQ Mode\n| Metric        | Meaning |\n|---------------|---------|\n| `reward`      | Main scalar reward (weighted sum of rubric criteria) |\n| `accuracy`    | Exact match on target MCQ answer (letter A–D) |\n\n### Open-Ended Mode\n| Metric        | Meaning |\n|---------------|---------|\n| `reward`      | Main scalar reward (weighted sum of rubric criteria) |\n| `judge_score` | LLM-assigned score evaluating answer quality, correctness, and clinical reasoning |\n\n## Example Usage\n\n```python\nimport verifiers as vf\n\n# Load MCQ environment\nenv_mcq = vf.load_environment(\"careqa\", split=\"en\", shuffle_answers=True)\n\n# Load open-ended environment\nenv_open = vf.load_environment(\n    \"careqa\",\n    split=\"open\",\n    judge_model=[\"openai/gpt-5-mini\", \"google/gemini-3-flash-preview\"],\n    judge_base_url=\"https://api.pinference.ai/api/v1\",\n)\n```\n","encoding":"utf-8","truncated":false,"total_bytes":3783},"status":null}