{"data":{"kind":"file","path":"README.md","version_id":"doeime8mpwd1zur8q23j2wn6","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2330,"modified_at":"2026-02-20T23:11:44.188000","content_hash":"d3c99af498f7d06a794abde98461e71e79a7214c9a27f19ac4faee31774c2bcb"},"entries":[],"content":"# AgentClinic Environment\n\nMulti-agent medical diagnosis environment for evaluating LLMs on clinical diagnosis through interactive conversations.\n\n## Quickstart\n\nRun a quick evaluation with `prime eval`:\n\n```bash\nprime eval run agentclinic -m \"openai/gpt-5-mini\" -n 5 -s\n```\n\nOr use `medarc-eval` with a single judge:\n\n```bash\nmedarc-eval agentclinic -m \"openai/gpt-5-mini\" -n 5 -s --judge-model \"openai/gpt-5-mini\"\n```\n\nThis environment supports multi-judge via the `medarc-verifiers` `MultiJudge` rubric. When multiple judge models are specified, each scores independently and the final reward is their mean. `judge_model`, `judge_base_url`, and `judge_api_key` are all list-valued and correspond positionally; if only one `judge_base_url` or `judge_api_key` is provided, it is used for all judge models.\n\n```bash\nmedarc-eval agentclinic \\\n  -m \"openai/gpt-5-mini\" \\\n  -n 10 \\\n  -s \\\n  --patient-model \"openai/gpt-5-mini\" \\\n  --measurement-model \"openai/gpt-5-mini\" \\\n  --judge-model \"openai/gpt-5-mini\" \\\n  --judge-model \"google/gemini-3-flash-preview\" \\\n```\n\n## Usage\n\nFilter to a specific dataset:\n\n```bash\nmedarc-eval agentclinic \\\n  -m \"openai/gpt-5-mini\" \\\n  -n 10 \\\n  -s \\\n  --dataset-path \"agentclinic_medqa_extended.jsonl\" \\\n  --judge-model \"openai/gpt-5-mini\"\n```\n\n## Configuration\n\n## Datasets\n\n- **MedQA Extended** (214 cases): `agentclinic_medqa_extended.jsonl`\n- **NEJM Extended** (120 cases): `agentclinic_nejm_extended.jsonl`\n  - Text-only in this environment; `image_url` is passed as plain text.\n\n## Other Options\n\n- `dataset_type`: `medqa` or `nejm` (auto-detect if omitted)\n- `max_turns`: Maximum conversation turns (default: 20)\n- `use_think`: Enable chain-of-thought prompting (default: false)\n- `patient_temperature` / `measurement_temperature`\n- `aux_max_tokens`: Max tokens for patient/measurement agents\n- `doctor_bias` / `patient_bias`: Cognitive bias injection (validated)\n\n\n## Agent Roles\n\n- **Doctor** (evaluated model): Asks questions, requests tests (e.g., \"REQUEST TEST: MRI_Brain_Spine\"), makes diagnosis\n- **Patient** (auxiliary LLM): Simulates realistic patient responses based on case symptoms\n- **Measurement** (auxiliary LLM): Returns test results from scenario data when requested\n- **Judge** (auxiliary LLM): Evaluates diagnosis accuracy using the canonical AgentClinic moderator prompt\n","encoding":"utf-8","truncated":false,"total_bytes":2330},"status":null}