{"data":{"kind":"file","path":"README.md","version_id":"was6zfzslh66jqwinmjx0ukd","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2922,"modified_at":"2026-01-27T17:55:00.547000","content_hash":"2c32e5b2c71d448280f783b7aa3ffcdadf5c2148b707f1b403997860accaf3d1"},"entries":[],"content":"# WebVoyager Browser Benchmark\n\nA browser benchmark environment for evaluating LLM agents on WebVoyager web navigation tasks using [Browserbase](https://browserbase.com).\n\nWebVoyager contains tasks across multiple real-world websites. Tasks are evaluated based on successful completion rather than explicit ground-truth answers.\n\n## Installation\n\nFirst, install the browser extras for verifiers:\n```bash\nuv pip install -e \".[browser]\"\n```\n\nThen install the webvoyager environment locally:\n```bash\nuv pip install -e ./environments/webvoyager\n```\n\n## Usage\n\n### Quick Start\n\n```bash\n# Run WebVoyager benchmark with OpenAI\nprime eval run webvoyager -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY\n```\n\n### Configuration\n\nSet your Browserbase credentials:\n```bash\nexport BROWSERBASE_API_KEY=\"your-api-key\"\nexport BROWSERBASE_PROJECT_ID=\"your-project-id\"\n```\n\nFor DOM mode (default), you'll also need:\n```bash\nexport OPENAI_API_KEY=\"your-openai-key\"  # For agent model and judge\nexport MODEL_API_KEY=\"your-openai-key\"   # For Stagehand browser operations\n```\n\n### Website Filtering\n\nWebVoyager includes tasks across many websites. You can filter by website:\n\n```bash\n# Run all tasks\nprime eval run webvoyager -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY\n\n# Run only Amazon tasks\nprime eval run webvoyager -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY -a '{\"web_filter\": \"Amazon\"}'\n\n# Run only Allrecipes tasks\nprime eval run webvoyager -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY -a '{\"web_filter\": \"Allrecipes\"}'\n```\n\n### Browser Modes\n\n**DOM Mode** (default): Uses Stagehand SDK for natural language browser control.\n```bash\nprime eval run webvoyager -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY\n```\n\n**CUA Mode**: Uses vision-based primitives via a CUA server.\n```bash\nprime eval run webvoyager -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY -a '{\"mode\": \"cua\", \"server_url\": \"http://localhost:3000\"}'\n```\n\n## Environment Arguments\n\n| Argument | Default | Description |\n|----------|---------|-------------|\n| `mode` | `\"dom\"` | Browser control mode (`\"dom\"` or `\"cua\"`) |\n| `max_turns` | `15` | Maximum conversation turns (recommended: 50 for complex tasks) |\n| `judge_model` | `\"gpt-4o-mini\"` | Model for task completion judging |\n| `num_examples` | `-1` | Number of examples (-1 for all) |\n| `web_filter` | `None` | Filter by website name |\n\n## Dataset\n\n- **Total tasks**: 642 tasks across multiple websites\n- **Websites**: Allrecipes, Amazon, Apple, ArXiv, BBC News, Booking, Cambridge Dictionary, Coursera, ESPN, GitHub, Google Flights, Google Map, Google Search, Hugging Face, Wolfram Alpha\n- **Task format**: Web navigation tasks\n- **Evaluation**: Task completion judging via LLM\n\n## Requirements\n\n- Python >= 3.10\n- Browserbase account with API credentials\n- OpenAI API key (for agent model, judge, and Stagehand)\n","encoding":"utf-8","truncated":false,"total_bytes":2922},"status":null}