{"data":{"kind":"file","path":"README.md","version_id":"m2q02nd1ms9ewoulkn19sdja","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4274,"modified_at":"2026-05-26T06:08:51.755000","content_hash":"98a6fd25dea7b649bbcca893c9aef24b84db0f3162ea1182e22cf98b34663592"},"entries":[],"content":"# ecom-bench\n\nHard cart-building tasks for browser agents on real Shopify e-commerce sites,\nvia `BrowserEnv` in DOM mode (Stagehand). The agent reads a shopping task,\ndrives a live storefront with `navigate` / `observe` / `act` / `extract`, and\nis scored on the resulting cart by a per-site verifier.\n\nTwo models are in the loop: the **rollout model under test** emits the tool\ncalls, and a separate **Stagehand grounder** (`stagehand_model`, default\n`anthropic/claude-haiku-4-5`) does the DOM grounding behind those tools.\n\n### Overview\n- **Environment ID**: `ecom-bench`\n- **Type**: multi-turn, tool use, browser\n- **Tags**: browser-agent, stagehand, shopify, eval, train\n\n### Dataset\n\n40 tasks across 4 Shopify storefronts, 10 per site:\n\n| Site | Tasks | Storefront |\n| --- | --- | --- |\n| gymshark | 10 | gymshark.com |\n| allbirds | 10 | allbirds.com |\n| chillys  | 10 | chillys.com |\n| kylie    | 10 | kyliecosmetics.com |\n\nTasks are adversarial cart-building prompts (exact totals, matching sets,\nsize/color constraints, multi-item bundles) mined to challenge frontier\nbrowser agents. Each row carries `info = {site, task_id, start_url, location,\nverifier_path}`; the rollout pre-navigates to `start_url` before turn 1.\n\n### Tools\n\n- `navigate(url)` — load a URL.\n- `observe(instruction)` — find DOM elements matching a description.\n- `act(instruction)` — perform a natural-language action on the page.\n- `extract(instruction, schema_json)` — pull structured data matching a JSON schema.\n\n### Reward\n\n- `cart_verifier_reward` (weight 1.0) — binary. A priority-10 `@vf.cleanup`\n  hook opens a second Playwright CDP connection to the live BrowserBase\n  session, runs the bundled per-site cart snapshot, and scores it with the\n  task's verifier (`(result, checks, debug)`). `result` is the reward.\n\nMetric (weight 0): `cart_total_qty` — total items in the captured cart, for\npartial-progress visibility even when the binary reward is 0.\n\n### Stop conditions\n\nA rollout ends when the model emits an assistant turn with no tool call\n(it considers the task done) or `max_turns` (default 30) is reached. Scoring\nruns at cleanup regardless of how the rollout ended.\n\n### Required environment variables\n\n- `BROWSERBASE_API_KEY` — Browserbase session creation + side-channel CDP attach.\n- `BROWSERBASE_PROJECT_ID` — Browserbase project for the sessions.\n- `ANTHROPIC_API_KEY` — key for the default Stagehand grounder\n  (`anthropic/claude-haiku-4-5`). If you change `stagehand_model` to another\n  provider, set that provider's key instead (`OPENAI_API_KEY` / `GOOGLE_API_KEY`).\n\nThe rollout model under test uses whatever provider/key the eval is configured\nwith (e.g. `--provider prime`), independent of the grounder key above.\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `site` | str | `None` | Keep only tasks from one site (e.g. `\"gymshark\"`). |\n| `task_ids` | list[str] | `None` | Keep only these task ids. |\n| `stagehand_model` | str | `\"anthropic/claude-haiku-4-5\"` | Stagehand's internal DOM-grounding LLM. |\n| `model_api_key_var` | str | derived | Env var Stagehand reads for the grounder key; derived from `stagehand_model`'s provider by default. |\n| `proxies` | bool | `True` | Route the Browserbase session through a residential proxy (Shopify CDNs bot-block without one). |\n| `max_turns` | int | `30` | Agent step cap per rollout. |\n| `proxy_model_to_stagehand` | bool | `False` | Route Stagehand's grounder calls through the verifiers client. Off because Stagehand's server hardcodes `api.openai.com` for non-OpenAI providers. |\n\n### Quickstart\n\n```sh\nprime env install ecom-bench\n\n# Single-task smoke (gymshark task 183), one rollout\nprime eval run ecom-bench -m claude-haiku-4-5 -n 1 -r 1 \\\n  -a '{\"task_ids\": [\"183\"]}'\n\n# One site, default haiku grounder\nprime eval run ecom-bench -m claude-haiku-4-5 -n 10 -r 1 -a '{\"site\": \"gymshark\"}'\n\n# Full 40-task run (push first so results auto-upload)\nprime env push --path environments/ecom_bench --visibility PRIVATE\nprime eval run vibrantlabsai/ecom-bench -m claude-haiku-4-5 -n 40 -r 1\n```\n\n`-n` caps how many of the 40 tasks are sampled — pass `-n 40` for full coverage\n(the bundled `[tool.verifiers.eval]` default samples fewer).\n","encoding":"utf-8","truncated":false,"total_bytes":4274},"status":null}