{"data":{"kind":"file","path":"README.md","version_id":"uuq2alb7vy3yvd2dxyd63xed","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2668,"modified_at":"2026-01-27T17:55:00.586000","content_hash":"c2dbca5f17a0928b84bbc67f9f86cc5e756ca7433b3d25b4fd3599188b945e53"},"entries":[],"content":"# GAIA Web Browser Benchmark\n\nA browser benchmark environment for evaluating LLM agents on GAIA (General AI Assistants) web tasks using [Browserbase](https://browserbase.com).\n\nGAIA tasks are multi-hop questions that require web browsing and reasoning to find answers. Each task has a ground-truth answer for evaluation.\n\n## Installation\n\nFirst, install the browser extras for verifiers:\n```bash\nuv pip install -e \".[browser]\"\n```\n\nThen install the gaia environment locally:\n```bash\nuv pip install -e ./environments/gaia\n```\n\n## Usage\n\n### Quick Start\n\n```bash\n# Run GAIA benchmark with OpenAI\nprime eval run gaia -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY\n```\n\n### Configuration\n\nSet your Browserbase credentials:\n```bash\nexport BROWSERBASE_API_KEY=\"your-api-key\"\nexport BROWSERBASE_PROJECT_ID=\"your-project-id\"\n```\n\nFor DOM mode (default), you'll also need:\n```bash\nexport OPENAI_API_KEY=\"your-openai-key\"  # For agent model and judge\nexport MODEL_API_KEY=\"your-openai-key\"   # For Stagehand browser operations\n```\n\n### Difficulty Levels\n\nGAIA has two difficulty levels:\n- **easy** (Level 1): 26 tasks, simpler multi-hop questions\n- **hard** (Level 2): 65 tasks, more complex reasoning required\n\n```bash\n# Run easy tasks (default)\nprime eval run gaia -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY\n\n# Run hard tasks\nprime eval run gaia -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY -a '{\"difficulty_level\": \"hard\"}'\n```\n\n### Browser Modes\n\n**DOM Mode** (default): Uses Stagehand SDK for natural language browser control.\n```bash\nprime eval run gaia -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY\n```\n\n**CUA Mode**: Uses vision-based primitives via a CUA server.\n```bash\nprime eval run gaia -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY -a '{\"mode\": \"cua\", \"server_url\": \"http://localhost:3000\"}'\n```\n\n## Environment Arguments\n\n| Argument | Default | Description |\n|----------|---------|-------------|\n| `mode` | `\"dom\"` | Browser control mode (`\"dom\"` or `\"cua\"`) |\n| `max_turns` | `15` | Maximum conversation turns (recommended: 50 for complex tasks) |\n| `judge_model` | `\"gpt-4o-mini\"` | Model for task completion judging |\n| `num_examples` | `-1` | Number of examples (-1 for all) |\n| `difficulty_level` | `\"easy\"` | Task difficulty (`\"easy\"` or `\"hard\"`) |\n\n## Dataset\n\n- **Total tasks**: 91 (26 easy + 65 hard)\n- **Task format**: Multi-hop questions with ground-truth answers\n- **Evaluation**: Answer matching with judge model verification\n\n## Requirements\n\n- Python >= 3.10\n- Browserbase account with API credentials\n- OpenAI API key (for agent model, judge, and Stagehand)\n","encoding":"utf-8","truncated":false,"total_bytes":2668},"status":null}