{"data":{"kind":"file","path":"README.md","version_id":"awtzvhk3naj3se9o7n6m6wtm","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4129,"modified_at":"2026-02-20T23:11:44.196000","content_hash":"ba81a9cf678a4a9bf1c28fce2615afa72c8a2641064d0abb3719105115455d02"},"entries":[],"content":"# MedAgentBench V2\n\n## Overview\n- **Environment ID**: `medagentbenchv2`\n- **Short description**: Tool-calling evaluation environment for clinical EHR tasks using FHIR APIs\n- **Tags**: medical, ehr, tool-calling, multi-turn, clinical, evaluation\n\n## Dataset\n- **Primary dataset**: MedAgentBench V2 evaluation tasks\n- **Source**: [MedAgentBench GitHub](https://github.com/stanfordmlgroup/MedAgentBench), [Paper](https://arxiv.org/abs/2501.14654)\n- **Task variants**:\n  - `new_patient_tasks`: Patient-centric clinical scenarios (default; matches upstream `collect_agent_responses.py`)\n  - `new_tasks`: General clinical tasks\n  - `test_data_v2`: Extended evaluation dataset\n- **Task categories**: 10 task types covering various EHR operations (task1-task10)\n\n## Task\n- **Type**: multi-turn tool-calling\n- **Rubric overview**: Binary scoring (0 or 1) based on successful task completion using category-specific evaluation functions\n\n## Note\n\nThis environment is a light verifiers wrapper around the original MedAgentBenchV2 code. The primary difference is the original code uses OpenAI's Responses API while this environment uses Chat Completions API through verifiers. The MedAgentBenchV2 prompts were lightly modified for generic tool calling versus the original's Responses API format examples.\n\n## Prerequisites\nBefore running evaluations, you must start the FHIR server:\n\n```bash\ndocker pull jyxsu6/medagentbench:latest\ndocker tag jyxsu6/medagentbench:latest medagentbench\ndocker run --platform linux/amd64 -e JAVA_TOOL_OPTIONS='-XX:+UseSerialGC -Xms256m -Xmx1024m' -p 8080:8080 medagentbench:latest\n```\n\n**Important**:\n- The trailing slash in the FHIR URL is required\n- Replace `localhost` with your actual IP address if running on a remote server\n- Server connectivity is automatically verified before evaluation begins\n\n## Quickstart\nRun an evaluation with default settings (requires FHIR server):\n\n```bash\nprime eval run medagentbenchv2 -m \"openai/gpt-5-mini\" -n 5 -s -a '{\"fhir_api_base\": \"http://localhost:8080/fhir/\"}'\n```\n\nRun a small evaluation on specific task types:\n\n```bash\nmedarc-eval medagentbenchv2 -m \"openai/gpt-5-mini\" -n 5 -s --fhir-api-base http://localhost:8080/fhir/ --task-types task1 --task-types task2\n```\n\nNotes:\n- Use direct environment flags with `medarc-eval` (for example, `--split validation` or `--judge-model gpt-5-mini`).\n- The FHIR server must be accessible at the specified URL\n- Models should support tool calling for this environment\n\n## Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `fhir_api_base` | str | Required | Base URL for the FHIR server (must include trailing slash) |\n| `tasks_variant` | str | `\"new_patient_tasks\"` | Task variant to use: `\"new_patient_tasks\"`, `\"new_tasks\"`, or `\"test_data_v2\"` |\n| `tasks_path` | str | None | Optional custom path to tasks JSON file (overrides `tasks_variant`) |\n| `task_types` | list[str] | None | Optional list of task types to filter (e.g., `[\"task1\", \"task2\"]`) |\n| `max_turns` | int | 8 | Maximum number of interaction turns per task |\n\n## Available Tools\n\nThe environment provides FHIR-based tools for clinical operations:\n- `fhir_patient_search`: Search for patients in the EHR\n- `fhir_observation_search`: Query clinical observations\n- `fhir_vitals_search`: Search vital signs\n- `fhir_vitals_create`: Create new vital sign records\n- `fhir_medication_request_search`: Search medication requests\n- `fhir_medication_request_create`: Create medication requests\n- `fhir_procedure_search`: Query procedures\n- `fhir_condition_search`: Search patient conditions\n- `fhir_service_request_create`: Create service requests\n- `finish`: Submit the final answer (required to complete tasks)\n\n## Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `medagentbench_reward` | (weight 1.0): Binary score (1 if task correctly solved, 0 otherwise) |\n\n## Task Categories\n\nThe environment evaluates 10 distinct task categories, each with specialized evaluation logic:\n- **task1-task10**: Various EHR operations including patient search, data retrieval, record creation, and clinical decision support\n","encoding":"utf-8","truncated":false,"total_bytes":4129},"status":null}