{"data":{"kind":"file","path":"README.md","version_id":"mipnwjxafufl8er5wej95ll5","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3056,"modified_at":"2026-01-27T17:54:49.508000","content_hash":"266e95e0e605d66f99318ef54341e27bcc4438b3f9817c99c24aec2d709024be"},"entries":[],"content":"# onlineMind2Web Browser Benchmark\n\nA browser benchmark environment for evaluating LLM agents on Mind2Web web navigation tasks using [Browserbase](https://browserbase.com).\n\nMind2Web contains tasks with varying difficulty levels. Tasks are evaluated based on successful completion rather than explicit ground-truth answers.\n\n## Installation\n\nFirst, install the browser extras for verifiers:\n```bash\nuv pip install -e \".[browser]\"\n```\n\nThen install the mind2web environment locally:\n```bash\nuv pip install -e ./environments/mind2web\n```\n\n## Usage\n\n### Quick Start\n\n```bash\n# Run Mind2Web benchmark with OpenAI\nprime eval run mind2web -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY\n```\n\n### Configuration\n\nSet your Browserbase credentials:\n```bash\nexport BROWSERBASE_API_KEY=\"your-api-key\"\nexport BROWSERBASE_PROJECT_ID=\"your-project-id\"\n```\n\nFor DOM mode (default), you'll also need:\n```bash\nexport OPENAI_API_KEY=\"your-openai-key\"  # For agent model and judge\nexport MODEL_API_KEY=\"your-openai-key\"   # For Stagehand browser operations\n```\n\n### Difficulty Levels\n\nMind2Web has three difficulty levels:\n- **easy**: 83 tasks, simpler navigation\n- **medium**: 143 tasks, moderate complexity\n- **hard**: 74 tasks, complex multi-step navigation\n\n```bash\n# Run all difficulty levels (default)\nprime eval run mind2web -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY\n\n# Run easy tasks only\nprime eval run mind2web -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY -a '{\"difficulty_level\": \"easy\"}'\n\n# Run medium tasks\nprime eval run mind2web -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY -a '{\"difficulty_level\": \"medium\"}'\n\n# Run hard tasks\nprime eval run mind2web -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY -a '{\"difficulty_level\": \"hard\"}'\n```\n\n### Browser Modes\n\n**DOM Mode** (default): Uses Stagehand SDK for natural language browser control.\n```bash\nprime eval run mind2web -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY\n```\n\n**CUA Mode**: Uses vision-based primitives via a CUA server.\n```bash\nprime eval run mind2web -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY -a '{\"mode\": \"cua\", \"server_url\": \"http://localhost:3000\"}'\n```\n\n## Environment Arguments\n\n| Argument | Default | Description |\n|----------|---------|-------------|\n| `mode` | `\"dom\"` | Browser control mode (`\"dom\"` or `\"cua\"`) |\n| `max_turns` | `50` | Maximum conversation turns (recommended: 50 for complex tasks) |\n| `judge_model` | `\"gpt-4o-mini\"` | Model for task completion judging |\n| `num_examples` | `-1` | Number of examples (-1 for all) |\n| `difficulty_level` | `None` | Task difficulty (`\"easy\"`, `\"medium\"`, `\"hard\"`, or `None` for all) |\n\n## Dataset\n\n- **Total tasks**: 300 tasks\n- **Difficulty distribution**: 83 easy, 143 medium, 74 hard\n- **Task format**: Web navigation tasks\n- **Evaluation**: Task completion judging via LLM\n\n## Requirements\n\n- Python >= 3.10\n- Browserbase account with API credentials\n- OpenAI API key (for agent model, judge, and Stagehand)\n","encoding":"utf-8","truncated":false,"total_bytes":3056},"status":null}