{"data":{"kind":"file","path":"README.md","version_id":"vszg84o90gstmjsswu39i1ik","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":5584,"modified_at":"2026-04-24T23:34:04.851000","content_hash":"ba9967d8914b161e345b3cdcda200925ec637913ad21606649ec5dc54b07c674"},"entries":[],"content":"# AutomationBench\n\nEvaluates AI agents on realistic, multi-step business workflows across CRM, email, calendar, and 47 simulated SaaS tools.\n\n- Website: [zapier.com/benchmarks](https://zapier.com/benchmarks)\n- GitHub: [github.com/zapier/AutomationBench](https://github.com/zapier/AutomationBench)\n- Paper: https://arxiv.org/abs/2604.18934\n\n### Overview\n- **Environment ID**: `zapier/AutomationBench`\n- **Author**: [Zapier](https://zapier.com)\n- **Tags**: tool-use, multi-turn, business-workflows, eval\n\n### What It Tests\n\nAutomationBench measures how well AI agents complete realistic business workflows — the kind of tasks that happen daily in sales, marketing, operations, support, finance, and HR. Each task initializes a simulated business environment (CRM, calendar, inbox, spreadsheets, etc.) and checks whether the agent leaves it in the correct state after executing a multi-step workflow.\n\nTasks require agents to:\n- Search for and call the right APIs across 47 simulated SaaS tools\n- Read data from multiple sources (spreadsheets, CRM records, email threads) to make decisions\n- Handle traps and distractors (stale data, irrelevant records, edge cases)\n- Execute multi-step workflows with correct sequencing and data transformation\n\n### Domains\n\n| Domain | Tasks | Coverage |\n|--------|-------|----------|\n| Sales | 100 | CRM updates, lead management, deal routing, cross-app workflows |\n| Marketing | 100 | Campaign ops, ad performance, content scheduling, brand monitoring |\n| Operations | 100 | Facility management, project tracking, vendor workflows, compliance |\n| Support | 100 | Ticket routing, SLA monitoring, knowledge base, multi-platform helpdesk |\n| Finance | 100 | AP/AR, expenses, reporting, bookkeeping |\n| HR | 100 | Recruitment, employee onboarding, time off, payroll |\n| Simple | 200 | Foundational single/two-step tasks (baseline, excluded from benchmark score) |\n| **Total** | **800** | |\n\n### Quickstart\n\nInstall and run with default settings:\n\n```bash\nprime env install zapier/AutomationBench\nprime eval run AutomationBench\n```\n\nConfigure model and sampling:\n\n```bash\nprime eval run AutomationBench -m gpt-5-mini -n 20 -r 1 -t 4096 -T 0.0\n```\n\nRun a specific domain:\n\n```bash\nprime eval run AutomationBench -a '{\"domains\": \"sales\"}'\n```\n\nRun multiple domains:\n\n```bash\nprime eval run AutomationBench -a '{\"domains\": \"sales,marketing\"}'\n```\n\n### Environment Arguments\n\nThese are passed via `-a` / `--env-args` as JSON:\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `domains` | str | `\"all\"` | Comma-separated domains: `sales`, `marketing`, `operations`, `support`, `finance`, `hr`, `simple`, or `all`. Note: `\"all\"` includes only the six scored domains (sales, marketing, operations, support, finance, hr). To include simple, pass it explicitly (e.g., `\"sales,simple\"`). |\n| `toolset` | str | `\"api\"` | Tool interface style: `api` (REST-style search + fetch) or `zapier` (meta-tools discovery) |\n\nUse standard `prime eval run` flags for other settings (e.g., `-n` for number of examples, `-r` for rollouts).\n\n### Toolsets\n\n**`api` (default)** — The agent gets two tools: `api_search` (find relevant API endpoints by query) and `api_fetch` (call an endpoint with method, URL, and body). This is the recommended toolset and mirrors how agents interact with real API platforms.\n\n**`zapier`** — The agent discovers tools through a meta-tool interface, searching for available actions and executing them by name. This matches Zapier's action-based automation model.\n\n### Scoring\n\nEach task defines a set of assertions that check the final state of the simulated environment. There are two scoring modes:\n\n**Pass Rate (`task_completed_correctly`, official benchmark score)** — A task is either fully correct or not. `task_completed_correctly` is 1.0 only if *every* assertion passes, 0.0 otherwise. The official AutomationBench benchmark score is the average of this metric across all scored tasks (simple domain excluded).\n\n**Partial Credit (`partial_credit`)** — The fraction of assertions satisfied per task (0.0 to 1.0). This is the rubric's weighted reward signal, which provides denser feedback for training and iterative development. A model that gets 7 out of 10 assertions right scores 0.7 `partial_credit` even though it scores 0.0 on pass rate.\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Weighted sum of rubric functions; equal to `partial_credit` here (partial credit has weight 1.0, task_completed_correctly has weight 0.0). |\n| `partial_credit` | Fraction of assertions satisfied per task (0.0 - 1.0). Denser signal for training / iterative development. |\n| `task_completed_correctly` | 1.0 if all assertions pass, 0.0 otherwise. Used for official benchmark pass rate. |\n| `num_turns` | Number of agent turns used |\n| `total_tool_calls` | Total tool invocations |\n| `api_search_calls` | Number of `api_search` calls (api toolset) |\n| `api_fetch_calls` | Number of `api_fetch` calls (api toolset) |\n\n### Public vs. Official Scores\n\nThis environment ships the **public** AutomationBench task set. The **official** leaderboard at [zapier.com/benchmarks](https://zapier.com/benchmarks) is scored on a separate, held-out private task set per domain. The private set follows the same task distribution and assertion framework as the public set but is never released, so scores you measure with this environment may not match the official leaderboard 1:1. Expect directional agreement — if a model improves on the public set, it is likely (but not guaranteed) to improve on the private set.\n\n","encoding":"utf-8","truncated":false,"total_bytes":5584},"status":null}