{"data":{"kind":"file","path":"README.md","version_id":"v95k4ux0t10j41jlasw5lyb7","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":6746,"modified_at":"2026-06-01T19:55:28.517000","content_hash":"ff0edfe2e9929d0211660ae7d8a1bc69ecdf49a3533ee5eeb129eda5c1cf2bb8"},"entries":[],"content":"# bfcl-v3\n\n<a href=\"https://github.com/PrimeIntellect-ai/research-environments/tree/main/environments/bfcl_v3\">\n<img src=\"https://img.shields.io/badge/GitHub-181717?style=for-the-badge&logo=github&logoColor=white\" alt=\"Source Code\">\n</a>\n\nThe Berkeley Function Calling Leaderboard (BFCL) evaluates an LLM's ability to call functions (aka tools) accurately. This implementation builds on top of the official `bfcl-eval` package and implements the BFCL v3 benchmark.\n\n> We pin the version to avoid breaking changes, but regularly check for updates and upgrade as needed. If you think we should bump, please raise an issue on GitHub or open a discussion on the Environments Hub.\n\n> There are some [known discrepancies](#known-discrepancies) compared to the official implementation. Make sure to review them before reporting official scores.\n\n### Overview\n- **Environment ID**: `bfcl-v3`\n- **Short description**: Berkeley Function Calling Leaderboard (BFCL) evaluation environment\n- **Tags**: `tool-use`, `eval`\n\n### Datasets\n- **Primary dataset(s)**: BFCL benchmark dataset via `bfcl-eval` package\n- **Source links**: [BFCL Leaderboard](https://gorilla.cs.berkeley.edu/leaderboard.html), [BFCL GitHub](https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard), [BFCL Blog](https://gorilla.cs.berkeley.edu/blog.html)\n- **Split sizes**: 4,441 examples (full eval suite)\n\n### Quickstart\n\nRun a full eval suite (`all` categories)\n\n```bash\nprime eval run bfcl-v3\n```\n\nRun only `single_turn` or `multi_turn` categories\n\n```bash\n# single_turn\nprime eval run bfcl-v3 -a '{\"test_categories\": [\"single_turn\"]}'\n\n# multi_turn\nprime eval run bfcl-v3 -a '{\"test_categories\": [\"multi_turn\"]}'\n```\n\nRun only `live` (user-contributed) or `non_live` (official) categories\n\n```bash\n# live\nprime eval run bfcl-v3 -a '{\"test_categories\": [\"live\"]}'\n\n# non_live\nprime eval run bfcl-v3 -a '{\"test_categories\": [\"non_live\"]}'\n```\n\nRun only `python` or `non_python` categories\n\n```bash\n# python\nprime eval run bfcl-v3 -a '{\"test_categories\": [\"python\"]}'\n\n# non_python\nprime eval run bfcl-v3 -a '{\"test_categories\": [\"non_python\"]}'\n```\n\nRun any combination of categories (or category groups)\n\n```bash\nprime eval run bfcl-v3 -a '{\"test_categories\": [\"simple_python\", \"simple_java\"]}'\n```\n\nSpecify the model and sampling settings\n\n```bash\nprime eval run bfcl-v3  \\\n  -m gpt-4.1-mini \\\n  -n 20 -r 3 -t 1024 -T 0.7 \\\n  -a '{\"test_categories\": [\"simple_python\", \"simple_java\"]}'\n```\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `test_categories` | list[str] | `[\"all\"]` | Categories to evaluate |\n| `examples_per_category` | int | `-1` | Limit examples per category (-1 for all) |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Binary reward (1.0 for correct, 0.0 for incorrect) using task-specific official checker |\n\n### Supported Categories\n\nWe supported a subset of [official categories](https://github.com/ShishirPatil/gorilla/blob/main/berkeley-function-call-leaderboard/TEST_CATEGORIES.md)\n\n| Category | Description | Supported |\n| -------- | ----------- | --------- |\n| `simple_python` | Single Python function calls | ✅\n| `simple_java` | Single Java function calls | ✅\n| `simple_javascript` | Single JavaScript function calls | ✅\n| `multiple` | Multiple function calls in sequence | ✅\n| `parallel` | Multiple function calls in parallel | ✅\n| `parallel_multiple` | Multiple function calls in parallel and in sequence | ✅\n| `irrelevance` | Function calls with irrelevant function documentation | ✅\n| `live_simple` | User-contributed simple function calls | ✅\n| `live_multiple` | User-contributed multiple function calls in sequence. | ✅\n| `live_parallel` | User-contributed multiple function calls in parallel | ✅\n| `live_parallel_multiple` | User-contributed multiple function calls in parallel and in sequence | ✅\n| `live_irrelevance` |  User-contributed function calls with irrelevant function documentation | ✅\n| `live_relevance` | User-contributed function calls with relevant function documentation | ✅\n| `multi_turn_base` | Multi-turn conversations | ✅\n| `multi_turn_miss_func` | Multi-turn with missing functions | ✅\n| `multi_turn_miss_param` | Multi-turn with missing parameters | ✅\n| `multi_turn_long_context` | Multi-turn with long context | ✅\n| `format_sensitivity` | Various system prompt formats to test the format sensitivity of the model | ❌\n\nWe do not support the `format_sensitivity` category as it does not contribute to the overall benchmark score.\n\n### Supported Category Groups\n\nWe do not support the category groups analogous to the unsupported categories, i.e. missing `memory`, `web_search`, and `agentic` groups.\n\n| Category Group | Description | Supported |\n| --------------- | ----------- | --------- |\n| `all` | All test categories | ✅\n| `all_scoring` | All scoring test categories that will affect the overall accuracy score | ✅\n| `multi_turn` | All multi-turn test categories | ✅\n| `single_turn` | All single-turn test categories | ✅\n| `live` | All user-contributed live test categories | ✅\n| `non_live` | All non-user-contributed test categories (the opposite of live) | ✅\n| `python` | Test categories specific to Python code | ✅\n| `non_python` | Test categories for code in languages other than Python, such as Java and JavaScript | ✅\n| `format_sensitivity` | Test categories for various system prompt formats to test the format sensitivity of the model | ❌\n\n### Known Discrepancies\n\nThere are some known discrepancies compared to the official implementation. Mostly these are due to current limitations of the `verifiers` framework. We aim to address these in the future to provide a 1-1 reference implementation.\n\n| Official BFCL | Our Implementation |\n| ------------- | --------- |\n| Computes a weighted or unweighted average per-group scores which are combined by `Overall Score = (Agentic × 40%) + (Multi-Turn × 30%) + (Live × 10%) + (Non-Live × 10%) + (Hallucination × 10%)` ([blog](https://gorilla.cs.berkeley.edu/blogs/15_bfcl_v4_web_search.html) | Reports a global unweighted average score of all supported categories |\n| May evaluate models without native function calling support (prompting) | Only evaluates models with native (OAI) function calling support |\n| Supports all categories | Missing `format_sensitivity` |\n\n### Changelog\n\n#### v0.1.3\n\n- Bump to `verifiers>=v0.1.11.dev0` to support new types\n\n#### v0.1.2\n\n- Switch to using a [fork](https://github.com/mikasenghaas/gorilla/tree) of the `bfcl-eval` package to losen dependency constraints\n\n#### v0.1.1\n\n- Bump to `verifiers>=v0.1.10.dev0` to remove commit pin\n\n#### v0.1.0\n\n- Initial release","encoding":"utf-8","truncated":false,"total_bytes":6746},"status":null}