{"data":{"kind":"file","path":"README.md","version_id":"ox50e2nbyfhi1iym058ajtpb","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":5160,"modified_at":"2026-03-17T01:26:11.053000","content_hash":"13133a91f60b1f3a60acb2f65419977d32d1ac590a033d51225548e65cff8e85"},"entries":[],"content":"# bb-demo\n\n### Overview\n- **Environment ID**: `bb-demo`\n- **Short description**: Browser automation environment using Browserbase CUA mode for vision-based web browsing tasks\n- **Tags**: `browser`, `browserbase`, `cua`, `vision`, `eval`\n\n### Datasets\n- **Primary dataset(s)**: Single hardcoded task for browsing Prime Intellect's website\n- **Task**: \"Go to the Prime Intellect website, find their blog page, and tell me what their latest blog is about\"\n- **Split sizes**: 1 example (demo/evaluation)\n\n### Task\n- **Type**: Multi-turn tool use (vision-based browser automation)\n- **Output format expectations**: Natural language response describing the latest blog post\n- **Rubric overview**: LLM judge (gpt-4.1-mini) evaluates whether the agent successfully completed the task\n\n### Available Tools (CUA Mode)\nThe environment provides vision-based browser automation tools:\n\n| Tool | Description |\n| ---- | ----------- |\n| `click(x, y)` | Click at screen coordinates |\n| `double_click(x, y)` | Double-click at screen coordinates |\n| `type_text(text)` | Type text into focused element |\n| `keypress(keys)` | Press keyboard keys |\n| `scroll(x, y, scroll_x, scroll_y)` | Scroll the page |\n| `goto(url)` | Navigate to a URL |\n| `back()` | Browser back button |\n| `forward()` | Browser forward button |\n| `wait(time_ms)` | Wait for specified milliseconds |\n| `screenshot()` | Capture current page screenshot |\n\n### Quickstart\n\nSet required environment variables:\n```bash\nexport BROWSERBASE_API_KEY=\"your-api-key\"\nexport BROWSERBASE_PROJECT_ID=\"your-project-id\"  # optional\nexport OPENAI_API_KEY=\"your-openai-key\"  # for judge model\n```\n\nInstall the environment locally:\n```bash\nprime env install bb-demo\n```\n\nRun an evaluation:\n```bash\nprime eval run bb-demo --model gpt-4o\n```\n\nWith custom settings:\n```bash\nprime eval run bb-demo \\\n  --model gpt-4o \\\n  --env-args '{\"save_screenshots\": true, \"mark_clicks\": true, \"screenshot_dir\": \"./my-screenshots\"}'\n```\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `save_screenshots` | bool | `True` | Save screenshots to disk during browsing |\n| `keep_recent_screenshots` | int \\| None | `2` | Number of recent screenshots to keep in model context (None = keep all) |\n| `screenshot_dir` | str \\| None | `./screenshots` | Directory to save screenshots |\n| `mark_clicks` | bool | `True` | Draw click markers on saved screenshots showing where clicks occurred |\n| `click_marker_radius` | int | `20` | Radius of click marker circle in pixels |\n\n### Screenshots & Click Markers\n\nScreenshots are automatically saved to the `screenshot_dir` (default: `./screenshots` in current working directory).\n\nWhen `mark_clicks=True` and Pillow is installed, click and double-click actions will have visual markers drawn on the saved screenshots showing:\n- Red/yellow concentric circles at the click location\n- Crosshairs through the click point\n- Text label with coordinates (e.g., \"CLICK (512, 384)\")\n\nThis helps with debugging and understanding agent behavior.\n\n**Important:** Click markers require Pillow to be installed. If Pillow is not available, screenshots will be saved without markers and you'll see a warning:\n```\nUserWarning: Pillow is not installed. Click markers will not be drawn on screenshots.\n```\n\nDebug output will show `PIL_AVAILABLE=False` if Pillow is missing.\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | 1.0 if judge determines task was completed successfully, 0.0 otherwise |\n| `judge_reward_func` | LLM judge evaluation result |\n| `num_turns` | Number of interaction turns taken |\n| `total_tool_calls` | Total number of tool calls made |\n| `click_calls` | Number of click tool calls |\n| `double_click_calls` | Number of double-click tool calls |\n| `scroll_calls` | Number of scroll tool calls |\n| `goto_calls` | Number of navigation tool calls |\n| `screenshot_calls` | Number of screenshot tool calls |\n\n### Requirements\n\n- `BROWSERBASE_API_KEY` environment variable\n- `OPENAI_API_KEY` environment variable (for judge model)\n- **Pillow** - Required for click markers on screenshots\n\n### Dependencies\n\n```toml\ndependencies = [\n    \"verifiers[browser]>=0.1.10\",\n    \"aiohttp>=3.13.2\",\n    \"browserbase>=1.4.0\",\n    \"playwright>=1.56.0\",\n    \"stagehand\",\n    \"Pillow>=10.0.0\",\n]\n```\n\n### Troubleshooting\n\n#### Click markers not appearing on screenshots\n\nIf screenshots are saved but don't show click markers:\n\n1. **Check if Pillow is installed** - Look for `PIL_AVAILABLE=False` in the debug output\n2. **Re-push the environment** if using remote execution:\n   ```bash\n   prime env push --path ./environments/bb_demo\n   ```\n3. **Verify locally** by checking the warning output when the environment loads\n\n#### Debug mode\n\nRun with `--debug` flag to see detailed output including:\n- `[ClickMarkerBrowserEnv] Initializing with mark_clicks=...`\n- `[ClickMarkerCUAMode] Initialized with mark_clicks=..., PIL_AVAILABLE=...`\n- `[ClickMarkerCUAMode.click] Called with x=..., y=...`\n- `[ClickMarkerCUAMode] Saving screenshot: mark_clicks=..., pending_click=..., PIL=...`\n\nExample:\n```bash\nprime eval run bb-demo -n 1 -r 1 -m gpt-4o --debug\n```\n","encoding":"utf-8","truncated":false,"total_bytes":5160},"status":null}