{"data":{"kind":"file","path":"README.md","version_id":"ud7nu2cq4lkh7mooya3i56ow","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3603,"modified_at":"2026-04-22T02:09:33.995000","content_hash":"d12e277ec91ff9f34a4f6c6b465d8d987eb96f96fb72c4115eae0dd475a6a1e2"},"entries":[],"content":"# medical-o1-verify\n\n### Overview\n- **Environment ID**: `medical-o1-verify`\n- **Short description**: Short-form clinical Q&A with ground-truth answers, scored by a lenient substring match combined with an LLM judge for semantic equivalence.\n- **Tags**: `medical`, `qa`, `single-turn`, `judge`\n\n### Datasets\n- **Primary dataset**: [`FreedomIntelligence/medical-o1-verifiable-problem`](https://huggingface.co/datasets/FreedomIntelligence/medical-o1-verifiable-problem) — ~40k open-ended clinical questions with short verifiable answers (diagnoses, drugs, mechanisms, etc.).\n- **Splits**: the upstream dataset ships only a `train` split; this environment shuffles with a fixed seed and carves off the first `num_eval_examples` rows (default 500) as the eval set, leaving the remainder for training.\n\n### Task\n- **Type**: single-turn\n- **Output format**: free-form reasoning followed by `Final answer: <short phrase>` on the last line (enforced only via system prompt; the rubric does not hard-require the prefix).\n- **Rubric**:\n  - `exact_match` (weight `1.0`) — case-insensitive substring check of the ground-truth answer inside the response.\n  - `judge_correctness` (weight `1.0`) — `JudgeRubric` calls `judge_model` with the default yes/no judge prompt and credits the rollout on `\"yes\"`.\n  - `response_length` — tracked as a metric only (weight `0`).\n\n### Quickstart\nSmoke-test with defaults (20 examples, 3 rollouts each):\n```bash\nprime eval run medical-o1-verify\n```\n\nPin the model and judge:\n```bash\nprime eval run medical-o1-verify \\\n  -m openai/gpt-4.1-mini \\\n  -n 50 -r 3 -t 1024 -T 0.3 \\\n  -a '{\"judge_model\": \"gpt-4.1-mini\", \"num_eval_examples\": 100}'\n```\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `num_eval_examples` | int | `500` | Size of the eval split carved off the shuffled dataset. |\n| `num_train_examples` | int | `-1` | Cap on the train split (`-1` keeps all remaining rows). |\n| `judge_model` | str | `\"openai/gpt-4.1-mini\"` | Model used by `JudgeRubric` for semantic-equivalence scoring. Prime-prefixed model id by default; use the bare `\"gpt-4.1-mini\"` style when hitting OpenAI directly. |\n| `judge_base_url` | str \\| None | `\"https://api.pinference.ai/api/v1\"` | OpenAI-compatible base URL for the judge client. Set to `None` to call OpenAI directly. |\n| `judge_api_key_var` | str | `\"PRIME_API_KEY\"` | Name of the env var that holds the judge API key. Point this at `OPENAI_API_KEY` when using OpenAI directly. |\n| `judge_weight` | float | `1.0` | Weight of the judge reward in the final score. |\n| `exact_weight` | float | `1.0` | Weight of the substring-match reward. |\n| `seed` | int | `42` | Shuffle seed for the deterministic train/eval split. |\n\n### Required Environment Variables\n- The env var named by `judge_api_key_var` (default `PRIME_API_KEY`) must be set. Validated at load time via `vf.ensure_keys`.\n\n### Using OpenAI directly for the judge\n```bash\nprime eval run medical-o1-verify \\\n  -m openai/gpt-4.1-mini \\\n  -a '{\"judge_base_url\": null, \"judge_api_key_var\": \"OPENAI_API_KEY\", \"judge_model\": \"gpt-4.1-mini\"}'\n```\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Weighted sum of `exact_match` and `judge_correctness`. |\n| `exact_match` | 1.0 if the ground-truth answer appears verbatim (case-insensitive) in the response, else 0.0. |\n| `judge_correctness` | 1.0 if the LLM judge answers \"yes\" to the semantic-equivalence prompt, else 0.0. |\n| `response_length` | Character count of the final assistant message. |\n| `num_turns` | Always 1 (single-turn environment). |\n","encoding":"utf-8","truncated":false,"total_bytes":3603},"status":null}