{"data":{"kind":"file","path":"README.md","version_id":"g3whmblxffqujtri4u6lx3f7","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3715,"modified_at":"2026-06-01T19:55:33.928000","content_hash":"f76a9761f45d98b24ef870e633a91afc9742acda3a095d193abc21a8d5204bc4"},"entries":[],"content":"# browsecomp\n\n<a href=\"https://github.com/PrimeIntellect-ai/research-environments/tree/main/environments/browsecomp\">\n<img src=\"https://img.shields.io/badge/GitHub-181717?style=for-the-badge&logo=github&logoColor=white\" alt=\"Source Code\">\n</a>\n\nBrowseComp is a simple yet challenging benchmark for measuring the ability for agents to browse the web. BrowseComp comprises 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers. BrowseComp for browsing agents can be seen as analogous to how programming competitions are an incomplete but useful benchmark for coding agents.\n\nThis particular implementation (statically) equips the model with two search-related tools based on the Exa API:\n- `search(query: str, num_results: int = 5)`: Query the web and return a list of results with titles, URLs, and highlights\n- `open(url: str, pattern: str = \"\")`: Open a specific website by providing its URL and retrieve a concise summary of the websites content\n\n### Overview\n\n- **Environment ID**: `browsecomp`\n- **Short description**: BrowseComp evaluation environment\n- **Tags**: `web-search`, `tool-use`, `llm-as-judge`\n- **Notes**: To use Exa, ensure that the `EXA_API_KEY` environment variable is set.\n\n### Datasets\n\n- **Primary dataset(s)**: BrowseComp, described in [this paper](https://arxiv.org/abs/2504.12516)\n- **Source links**: [Encrypted dataset](https://openaipublic.blob.core.windows.net/simple-evals/browse_comp_test_set.csv)\n- **Split sizes**: 1,266 examples\n\n### Quickstart\n\nRun a full eval suite\n\n```bash\nprime eval run browsecomp\n```\n\nSpecify the model and sampling settings\n\n```bash\nprime eval run browsecomp \\\n  -m gpt-4.1-mini \\\n  -n 20 -r 3 -t 1024 -T 0.7\n```\n\n### Environment Arguments\n\nDocument any supported environment arguments and their meaning. Example:\n\n| Arg         | Type | Default | Description             |\n| ----------- | ---- | ------- | ----------------------- |\n| `max_turns` | int  | `100`    | Maximum number of turns |\n| `judge_model` | str | `\"openai/gpt-4.1-mini\"` | Judge model name |\n| `judge_base_url` | str | `\"https://api.pinference.ai/api/v1\"` | Judge provider base URL |\n| `judge_api_key_var` | str | `\"PRIME_API_KEY\"` | Env var containing judge API key |\n\n### Metrics\n\n| Metric        | Meaning                                                        |\n| ------------- | -------------------------------------------------------------- |\n| `correct`     | 1 if the model's answer is judged correct, 0 otherwise         |\n| `judge_confidence` | Confidence score of the judge's answer |\n| `model_confidence` | Confidence score of the model's answer |\n\nThe main `reward` metric is identical to `correct`, which returns 1.0 if the model's answer is judged correct.\n\n### Changelog\n\n#### v0.1.6 (2026-05-14)\n\n- Default judge requests now use Pinference (`https://api.pinference.ai/api/v1`) with `PRIME_API_KEY`, Verifiers client setup for Prime team auth, and the Pinference-qualified `openai/gpt-4.1-mini` model name.\n\n#### v0.1.5 (2026-01-16)\n\n- Use exact instruction + judge prompt template from OpenAI's reference implementation ([`simple-evals`](https://github.com/openai/simple-evals/blob/main/browsecomp_eval.py))\n- Increase default `max_turns` from 20 to 100\n- Allow model to use specify the `num_results` argument in the `search` tool\n- Fail loudly if the `EXA_API_KEY` environment variable is not set (at env init time)\n- Use the metrics pattern to show the `judge_confidence` and `model_confidence` as metrics\n- Cleanup implementation\n- Update README\n","encoding":"utf-8","truncated":false,"total_bytes":3715},"status":null}