{"data":{"kind":"file","path":"README.md","version_id":"ngo8mgwlgtgr7tvajc1hm98e","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3081,"modified_at":"2026-02-03T10:12:41.954000","content_hash":"b1f88e1182cd74f94853b76476db562eaced61a8f60b6a39bedb1269a81a4aab"},"entries":[],"content":"# order-medical-test-tool-call\n\n### Overview\n- **Environment ID**: `order-medical-test-tool-call`\n- **Short description**: Environment for evaluating **medical test ordering via tool calls**. Models must analyze patient case reports and generate correct JSON tool calls to order appropriate medical tests.\n- **Tags**: `tool-calling`, `medical`, `function-calling`\n\n### Datasets\n- **Primary dataset(s)**: `mkurman/order-medical-test-tool-call`\n- **Source links**: https://huggingface.co/datasets/mkurman/order-medical-test-tool-call\n- **Split sizes**: ~1.53k train, ~170 test\n\n### Task\n- **Type**: single-turn\n- **Parser**: custom (extracts JSON from `<tool_call>` tags)\n- **Rubric overview**: Scores both JSON validity and individual key-value correctness. Models must output a tool call in the format:\n\n```xml\n<tool_call>\n{\n  \"name\": \"order_medical_test\",\n  \"arguments\": {\n    \"patient_id\": \"PMC3539401_01\",\n    \"physician_id\": \"862\",\n    \"priority\": \"ROUTINE\",\n    \"test_code\": \"TSH\"\n  }\n}\n</tool_call>\n```\n\n### Installation\n\n```bash\nprime install order-medical-test-tool-call\n```\n\n### Quickstart\n\nRun an evaluation with default settings:\n\n```bash\nprime eval order-medical-test-tool-call\n```\n\nWith a local model:\n\n```bash\nexport DUMMY_KEY=\"dummy\"\nprime eval order-medical-test-tool-call \\\n  -m zai-org/GLM-4.7-Flash \\\n  -b http://localhost:8001/v1 \\\n  -k DUMMY_KEY \\\n  -n 10 -r 1 -t 8192 -c 1\n```\n\nConfigure model and sampling:\n\n```bash\nprime eval order-medical-test-tool-call \\\n  -m gpt-4.1-mini \\\n  -n 20 -r 3 -t 1024 -T 0.7\n```\n\n### Metrics\n\nThe rubric scores multiple dimensions with weighted contributions:\n\n| Metric | Weight | Meaning |\n| ------ | ------ | ------- |\n| `json_valid_score` | 0.10 | Valid JSON structure in `<tool_call>` tags |\n| `tool_call_name_score` | 0.10 | Correct function name (`order_medical_test`) |\n| `patient_id_score` | 0.15 | Correct patient identifier |\n| `physician_id_score` | 0.15 | Correct physician identifier |\n| `test_code_score` | 0.20 | Correct medical test code (CBC, TSH, LIPID, etc.) |\n| `priority_score` | 0.10 | Correct priority level (ROUTINE or STAT) |\n| `full_match_score` | 0.20 | Bonus for exact full match on all fields |\n| `reward` | 1.00 | Main scalar reward (weighted sum of above) |\n\n### Tool Definition\n\nThe model is provided with this tool specification:\n\n```json\n{\n  \"name\": \"order_medical_test\",\n  \"description\": \"Places an order for a specific medical test for a given patient.\",\n  \"parameters\": {\n    \"patient_id\": \"The unique identifier for the patient\",\n    \"test_code\": \"The specific code for the medical test (e.g., 'CBC', 'LIPID', 'TSH')\",\n    \"priority\": \"The priority of the test ('ROUTINE' or 'STAT')\",\n    \"physician_id\": \"The unique identifier for the ordering physician\"\n  }\n}\n```\n\n### Example\n\n**Input**: A medical case report with patient/physician IDs embedded in the prompt.\n\n**Expected Output**:\n```xml\n<tool_call>\n{\n  \"name\": \"order_medical_test\",\n  \"arguments\": {\n    \"patient_id\": \"PMC8100432_01\",\n    \"physician_id\": \"1488\",\n    \"priority\": \"STAT\",\n    \"test_code\": \"CBC\"\n  }\n}\n</tool_call>\n```\n","encoding":"utf-8","truncated":false,"total_bytes":3081},"status":null}