{"data":{"kind":"file","path":"README.md","version_id":"fgvdrzxwxbwux66o36ats8vo","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":1392,"modified_at":"2025-08-22T02:01:39","content_hash":"6c915fff25feeeae0085dc0acf7259ff9542f37a893c5c6ac525d2d797286a72"},"entries":[],"content":"# tool-test\n\n### Overview\n- **Environment ID**: `tool-test`\n- **Short description**: Sanity-check tool-calling environment that asks models to invoke a random subset of dummy tools.\n- **Tags**: tools, single-turn, function-calling, sanity\n\n### Datasets\n- **Primary dataset(s)**: Synthetic HF dataset generated in-memory with prompts specifying required tools\n- **Source links**: N/A (programmatically generated)\n- **Split sizes**: Controlled by `num_train_examples` and `num_eval_examples`\n\n### Task\n- **Type**: tool use (single-turn ToolEnv)\n- **Parser**: default `Parser`\n- **Rubric overview**: ToolRubric checks tool execution and adds exact match on the required tool set\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval tool-test\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval tool-test \\\n  -m gpt-4.1-mini \\\n  -n 20 -r 3 -t 1024 -T 0.7 \\\n  -a '{\"num_train_examples\": 1000, \"num_eval_examples\": 100}'\n```\n\n\n### Environment Arguments\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `num_train_examples` | int | `1000` | Number of training examples |\n| `num_eval_examples` | int | `100` | Number of evaluation examples |\n\n### Metrics\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | 1.0 if called tool set equals required set, else 0.0 |\n| ToolRubric metrics | Tool execution success and format adherence |\n","encoding":"utf-8","truncated":false,"total_bytes":1392},"status":null}