{"data":{"kind":"file","path":"README.md","version_id":"qbvn4z4lcuaqxrogcc6e239g","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3570,"modified_at":"2026-03-25T23:59:17.326000","content_hash":"84da498474e875bab89db6c9840bc02294412c36b8ab72744fd6e18afb2ddb5d"},"entries":[],"content":"# WebVoyager Browser Benchmark (No Anti-Bot)\n\nA browser benchmark environment for evaluating LLM agents on WebVoyager web navigation tasks using [Browserbase](https://browserbase.com).\n\nThis version uses a **filtered dataset** that excludes websites with anti-bot protection for more reliable evaluation.\n\nWebVoyager contains tasks across multiple real-world websites. Tasks are evaluated based on successful completion rather than explicit ground-truth answers.\nThis environment judges against the recorded browser interaction transcript, not just the agent's final free-text claim.\n\n## Dataset\n\n- **Total tasks**: 600 tasks (93.3% of original 643 tasks)\n- **Websites**: Allrecipes, Amazon, Apple, ArXiv, BBC News, Booking, Coursera, ESPN, GitHub, Google Flights, Google Map, Google Search, Hugging Face, Wolfram Alpha\n- **Excluded sites**: dictionary.cambridge.org (Cloudflare protection)\n- **Removed tasks**: 43 tasks from 1 site with anti-bot detection\n- **Task format**: Web navigation tasks\n- **Evaluation**: Task completion judging via LLM over the browser interaction transcript\n\n## Installation\n\nFirst, install the browser extras for verifiers:\n```bash\nuv pip install -e \".[browser]\"\n```\n\nThen install the webvoyager-no-anti-bot environment locally:\n```bash\nuv pip install -e ./environments/webvoyager_no_anti_bot\n```\n\nOr install from Prime hub:\n```bash\nprime env install browserbase/webvoyager-no-anti-bot\n```\n\n## Usage\n\n### Quick Start\n\n```bash\n# Run WebVoyager benchmark with OpenAI (clean dataset)\nprime eval run webvoyager-no-anti-bot -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY\n```\n\n### Configuration\n\nSet your Browserbase credentials:\n```bash\nexport BROWSERBASE_API_KEY=\"your-api-key\"\n# Optional: export BROWSERBASE_PROJECT_ID=\"your-project-id\"\n```\n\nFor DOM mode (default), you'll also need:\n```bash\nexport OPENAI_API_KEY=\"your-openai-key\"  # For agent model and judge\nexport MODEL_API_KEY=\"your-openai-key\"   # For Stagehand browser operations\n```\n\n### Website Filtering\n\nWebVoyager includes tasks across many websites. You can filter by website:\n\n```bash\n# Run all tasks (clean dataset)\nprime eval run webvoyager-no-anti-bot -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY\n\n# Run only Amazon tasks\nprime eval run webvoyager-no-anti-bot -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY -a '{\"web_filter\": \"Amazon\"}'\n\n# Run only Allrecipes tasks\nprime eval run webvoyager-no-anti-bot -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY -a '{\"web_filter\": \"Allrecipes\"}'\n```\n\n### Browser Modes\n\n**DOM Mode** (default): Uses Stagehand SDK for natural language browser control.\n```bash\nprime eval run webvoyager-no-anti-bot -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY\n```\n\n**CUA Mode**: Uses vision-based primitives via a CUA server.\n```bash\nprime eval run webvoyager-no-anti-bot -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY -a '{\"mode\": \"cua\", \"server_url\": \"http://localhost:3000\"}'\n```\n\n## Environment Arguments\n\n| Argument | Default | Description |\n|----------|---------|-------------|\n| `mode` | `\"dom\"` | Browser control mode (`\"dom\"` or `\"cua\"`) |\n| `max_turns` | `15` | Maximum conversation turns (recommended: 50 for complex tasks) |\n| `judge_model` | `\"gpt-4o-mini\"` | Model for task completion judging |\n| `num_examples` | `-1` | Number of examples (-1 for all) |\n| `web_filter` | `None` | Filter by website name |\n\n## Requirements\n\n- Python >= 3.10\n- Browserbase account with API credentials\n- OpenAI API key (for agent model, judge, and Stagehand)\n","encoding":"utf-8","truncated":false,"total_bytes":3570},"status":null}