{"data":{"kind":"file","path":"README.md","version_id":"h09vyq1s0wwrqfccfikwibdj","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3701,"modified_at":"2026-06-24T14:00:23.409000","content_hash":"7dd405662d6d3537e5d2c85992e09e80742b4068688b88aea61aeadf58469fb4"},"entries":[],"content":"# order-medical-test-tool-call\n\n### Overview\n- **Environment ID**: `order-medical-test-tool-call`\n- **Short description**: Environment for evaluating **medical test ordering via tool calls**. Models must analyze patient case reports and generate correct JSON tool calls to order appropriate medical tests.\n- **Tags**: `tool-calling`, `medical`, `function-calling`\n\n### Datasets\n- **Primary dataset(s)**: `mkurman/order-medical-test-tool-call`\n- **Source links**: https://huggingface.co/datasets/mkurman/order-medical-test-tool-call\n- **Split sizes**: ~1.53k train, ~170 test\n\n### Task\n- **Type**: single-turn\n- **Parser**: custom (extracts JSON from `<tool_call>` tags)\n- **Rubric overview**: Scores both JSON validity and individual key-value correctness. Models must output a tool call in the format:\n\n```xml\n<tool_call>\n{\n  \"name\": \"order_medical_test\",\n  \"arguments\": {\n    \"patient_id\": \"PMC3539401_01\",\n    \"physician_id\": \"862\",\n    \"priority\": \"ROUTINE\",\n    \"test_code\": \"TSH\"\n  }\n}\n</tool_call>\n```\n\n### Installation\n\n```bash\nprime install order-medical-test-tool-call\n```\n\n### Quickstart\n\nRun an evaluation with default settings:\n\n```bash\nprime eval order-medical-test-tool-call\n```\n\nWith a local model:\n\n```bash\nexport DUMMY_KEY=\"dummy\"\nprime eval order-medical-test-tool-call \\\n  -m zai-org/GLM-4.7-Flash \\\n  -b http://localhost:8001/v1 \\\n  -k DUMMY_KEY \\\n  -n 10 -r 1 -t 8192 -c 1\n```\n\nConfigure model and sampling:\n\n```bash\nprime eval order-medical-test-tool-call \\\n  -m gpt-4.1-mini \\\n  -n 20 -r 3 -t 1024 -T 0.7\n```\n\n### Metrics\n\nThe rubric scores multiple dimensions with weighted contributions:\n\n| Metric | Weight | Meaning |\n| ------ | ------ | ------- |\n| `json_valid_score` | 0.10 | Valid JSON structure in `<tool_call>` tags |\n| `tool_call_name_score` | 0.10 | Correct function name (`order_medical_test`) |\n| `patient_id_score` | 0.15 | Correct patient identifier |\n| `physician_id_score` | 0.15 | Correct physician identifier |\n| `test_code_score` | 0.20 | Correct medical test code (CBC, TSH, LIPID, etc.) |\n| `priority_score` | 0.10 | Correct priority level (ROUTINE or STAT) |\n| `full_match_score` | 0.20 | Bonus for exact full match on all fields |\n| `reward` | 1.00 | Main scalar reward (weighted sum of above) |\n\n> **⚠️ Training note (GRPO / prime-rl):** The sub-scores must have **non-zero weights** for RL training. If only `full_match_score` has weight 1.0 (pure 0/1 binary reward), all rollouts in a group tend to receive the same reward, GRPO advantage collapses to zero, and prime-rl's `zero_advantage` filter drops the entire batch — producing the `RuntimeError: 10 consecutive zero-trainable batches` crash. The partial-credit weights above create within-group reward variance so that advantages are non-zero and the trainer receives gradient signal. For **eval-only** runs, binary `[0,0,0,0,0,0,1]` weights are fine.\n\n### Tool Definition\n\nThe model is provided with this tool specification:\n\n```json\n{\n  \"name\": \"order_medical_test\",\n  \"description\": \"Places an order for a specific medical test for a given patient.\",\n  \"parameters\": {\n    \"patient_id\": \"The unique identifier for the patient\",\n    \"test_code\": \"The specific code for the medical test (e.g., 'CBC', 'LIPID', 'TSH')\",\n    \"priority\": \"The priority of the test ('ROUTINE' or 'STAT')\",\n    \"physician_id\": \"The unique identifier for the ordering physician\"\n  }\n}\n```\n\n### Example\n\n**Input**: A medical case report with patient/physician IDs embedded in the prompt.\n\n**Expected Output**:\n```xml\n<tool_call>\n{\n  \"name\": \"order_medical_test\",\n  \"arguments\": {\n    \"patient_id\": \"PMC8100432_01\",\n    \"physician_id\": \"1488\",\n    \"priority\": \"STAT\",\n    \"test_code\": \"CBC\"\n  }\n}\n</tool_call>\n```\n","encoding":"utf-8","truncated":false,"total_bytes":3701},"status":null}