{"data":{"kind":"file","path":"README.md","version_id":"lcgvu7mtv8c7o1g0sfhuussp","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":5926,"modified_at":"2026-06-21T02:18:51.877000","content_hash":"ebba6f185ae5723a8780caf774d05d2df81f30c567208f5ddb9532efa7562f4a"},"entries":[],"content":"# web-search-env\n\n`web-search-env` is a Verifiers environment for search-based QA tasks. It gives\nthe model a single `search(query)` tool and keeps the search provider hidden\nbehind a provider-agnostic backend.\n\nUse it to evaluate the same agent with different search providers, or to run the\nbenchmark suite from Exa's search RL blog post.\n\n## Install\n\n```bash\nprime env install web-search-env\n```\n\n## Search Providers\n\nSet `search_backend` to one of:\n\n- `fixture`: local deterministic search, no API key required\n- `exa`: Exa search via `EXA_API_KEY`\n- `parallel`: Parallel search via `PARALLEL_API_KEY`\n- `serpapi`: Google results through SerpApi via `SERPAPI_API_KEY`\n\nExample:\n\n```bash\nprime eval run web-search-env \\\n  -m openai/gpt-4.1-mini \\\n  -n 20 -r 1 -t 1024 \\\n  -a '{\"eval_dataset\":\"hotpotqa\",\"search_backend\":\"exa\",\"judge_backend\":\"llm\"}'\n```\n\nUse the same command with only `search_backend` changed to compare providers.\n`cache_path` can be set to reuse live search results across repeated runs.\n\n## Datasets\n\nSupported datasets:\n\n- `sample`\n- `simpleqa`\n- `2wikimultihopqa` or `2wiki`\n- `hotpotqa`\n- `frames`\n- `musique`\n- `browsecomp`\n- `hle` or `hle-text`\n- `widesearch`\n\nBrowseComp is packaged with the environment, so it does not download the CSV at\nruntime.\n\nHLE uses the official `cais/hle` dataset. That dataset is gated on Hugging Face:\naccept access there and set `HF_TOKEN` or `HUGGING_FACE_HUB_TOKEN` before\nrunning it. The loader uses the text-only subset because this environment does\nnot pass images to the model.\n\nWideSearch uses the official `ByteDance-Seed/WideSearch` dataset and downloads\nthe matching gold CSV tables from the same Hugging Face repo. Use an LLM judge\nfor this dataset; exact or normalized string matching is not meaningful for\nlarge table outputs.\n\nHLE and WideSearch are opt-in only. They are not part of the default sample\nrun, the Exa blog suite, or the Exa training setup.\n\nRun a single dataset:\n\n```bash\nprime eval run web-search-env \\\n  -m openai/gpt-4.1-mini \\\n  -n 50 -r 1 -t 1024 \\\n  -a '{\n    \"eval_dataset\":\"musique\",\n    \"search_backend\":\"parallel\",\n    \"parallel_mode\":\"advanced\",\n    \"judge_backend\":\"llm\"\n  }'\n```\n\nRun multiple datasets:\n\n```bash\nprime eval run web-search-env \\\n  -m openai/gpt-4.1-mini \\\n  -n 100 -r 1 -t 1024 \\\n  -a '{\"eval_dataset\":\"musique,hotpotqa\",\"search_backend\":\"exa\",\"judge_backend\":\"llm\"}'\n```\n\nRun HLE:\n\n```bash\nprime eval run web-search-env \\\n  -m openai/gpt-4.1-mini \\\n  -n 20 -r 1 -t 4096 \\\n  -a '{\"eval_dataset\":\"hle\",\"search_backend\":\"exa\",\"judge_backend\":\"llm\"}'\n```\n\nRun WideSearch:\n\n```bash\nprime eval run web-search-env \\\n  -m openai/gpt-4.1-mini \\\n  -n 5 -r 1 -t 8192 \\\n  -a '{\"eval_dataset\":\"widesearch\",\"search_backend\":\"exa\",\"judge_backend\":\"llm\",\"max_turns\":10}'\n```\n\n## Exa Blog Suite\n\nThe Exa blog suite is:\n\n```text\nsimpleqa\n2wikimultihopqa\nhotpotqa\nframes\nmusique\nbrowsecomp\n```\n\nUse `benchmark_suite=\"exa\"` to run that set:\n\n```bash\nprime eval run web-search-env \\\n  -m openai/gpt-4.1-mini \\\n  -n 30 -r 1 -t 1024 -T 0.2 \\\n  -a '{\n    \"benchmark_suite\":\"exa\",\n    \"benchmark_examples_per_dataset\":5,\n    \"search_backend\":\"exa\",\n    \"judge_backend\":\"llm\",\n    \"cache_path\":\"./search_traces/exa_suite.json\"\n  }'\n```\n\n`benchmark_examples_per_dataset=5` means 5 examples from each of the 6 datasets,\nso `-n 30` runs 30 total rollouts when `-r 1`.\n\nThis selects the same public benchmark set described in the Exa post. Exact\nreported numbers still depend on the model, sampling settings, judge, seeds, and\nlive search results.\n\nFor an n=8 run, use:\n\n```bash\nprime eval run configs/eval/web-search-exa-suite-n8.toml\n```\n\nPackaged copies of the n=8 configs are also included under\n`environments/web_search_env/configs/eval/`.\n\nThat config runs the six-benchmark suite as one eval. To print per-benchmark\nscores from the saved `results.jsonl`, run:\n\n```bash\npython environments/web_search_env/scripts/summarize_by_benchmark.py \\\n  environments/web_search_env/outputs/evals/<run-dir>/<run-id>/results.jsonl\n```\n\nUse `configs/eval/web-search-exa-suite-n8-by-benchmark.toml` instead if you want\nPrime's own final summary to have one eval section per benchmark.\n\n## Exa Training Setup\n\nUse `exa_training_setup=true` for the training and reward design described in\nthe Exa post. It selects MuSiQue and HotpotQA for training, uses the\nSimpleQA-style LLM grader as the terminal binary reward, applies a `-0.25`\ncontext-limit penalty, and sets the eval side to the six after-train benchmarks:\n\n```text\nmusique\nhotpotqa\n2wikimultihopqa\nframes\nbrowsecomp\nsimpleqa\n```\n\nHosted training config:\n\n```bash\nprime train configs/rl/web-search-exa-training.toml \\\n  --env-var EXA_API_KEY \\\n  --env-var OPENAI_API_KEY\n```\n\nThe packaged copy is also under\n`environments/web_search_env/configs/rl/web-search-exa-training.toml`.\n\n## Common Args\n\n- `eval_dataset`: dataset name or comma-separated list\n- `train_datasets`: training dataset name or comma-separated list\n- `training_suite`: use `\"exa\"` for MuSiQue + HotpotQA training data\n- `exa_training_setup`: use the Exa training/reward/eval defaults\n- `benchmark_suite`: use `\"exa\"`, `\"hle\"`, or `\"widesearch\"`\n- `benchmark_examples_per_dataset`: per-dataset cap for suite runs\n- `train_examples_per_dataset`: per-dataset cap for training suite data\n- `search_backend`: `fixture`, `exa`, `parallel`, or `serpapi`\n- `result_count`: search results per tool call, default `5`\n- `snippet_chars`: max snippet characters per result, default `2000`\n- `search_max_qps`: optional per-process request cap for live search providers\n- `search_max_retries`: retry count for retryable search failures\n- `search_initial_jitter_seconds`: random delay before each rollout's first live search\n- `judge_backend`: `simpleqa`, `llm`, `normalized`, `exact`, or `substring`\n- `judge_model`: judge model for `judge_backend=\"simpleqa\"` or `judge_backend=\"llm\"`\n- `max_turns`: max assistant/tool turns, default `10`\n","encoding":"utf-8","truncated":false,"total_bytes":5926},"status":null}