{"data":{"kind":"file","path":"README.md","version_id":"kahpqwe5r4qwnnru28t6h6gk","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4027,"modified_at":"2026-06-01T19:55:33.941000","content_hash":"84f6624524e51daf1743abc01e58ca6cb4950cbbb1ee2c8c30c5332945e88f25"},"entries":[],"content":"# ddbc\n\n<a href=\"https://github.com/PrimeIntellect-ai/research-environments/tree/main/environments/ddbc\">\n<img src=\"https://img.shields.io/badge/GitHub-181717?style=for-the-badge&logo=github&logoColor=white\" alt=\"Source Code\">\n</a>\n\n### Overview\n\n- **Environment ID**: `ddbc`\n- **Short description**: Browsecomp with improved DeepDive tools.\n- **Tags**: qa,multiturn,search,tool-use\n\n### Datasets\n\n- **Primary dataset(s)**: BrowseComp, described in [this paper](https://arxiv.org/abs/2504.12516)\n- **Source links**: [Encrypted dataset](https://openaipublic.blob.core.windows.net/simple-evals/browse_comp_test_set.csv)\n- **Split sizes**: 1,266 examples\n\n### Quickstart\n\nRun a full eval suite\n\n```bash\nprime eval run ddbc\n```\n\nSpecify the model and sampling settings\n\n```bash\nprime eval run ddbc \\\n  -m gpt-4.1-mini \\\n  -n 20 -r 3 -t 1024 -T 0.7\n```\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `max_turns` | int | 32 | Max number of turns |\n| `serper_api_key_var` | str | \"SERPER_API_KEY\" | Env var with Serper api key |\n| `max_search_results` | int | 10 | Maximum number of search results from Serper |\n| `max_concurrent_search` | int | 10 | Maximum number of queries that can be issued in parallel in a single `search_web` call. Queries beyond this limit are ignored |\n| `max_response_chars` | int \\| float(\"+inf\") | 20_000 | Truncate combined search results and individual scan/open outputs to this length in characters |\n| `serper_timeout` | float | 15 | Timeout for search |\n| `log_level` | str \\| int | \"INFO\" | Logging level for DeepDive loggers (e.g., \"DEBUG\", \"INFO\") |\n| `finish_with_tool` | bool | True | If `True`, the model will finish via the `finish` tool; if `False`, it will provide the answer in its final output inside \"\\boxed{...}\". For both, the fallback is the full final completion |\n| `open_max_workers` | int | 64 | Number of threads for URL fetching and HTML/PDF parsing |\n| `open_max_concurrency` | int | 64 | Max concurrent URL fetches per process |\n| `open_max_connections` | int | 256 | Max pooled HTTP connections per process |\n| `open_max_connections_per_host` | int | 0 | Max pooled HTTP connections per host (0 = unlimited) |\n| `cache_dir` | str \\| None | None | Directory for disk cache. For multi-node setups, use a shared filesystem path. Falls back to `DEEPDIVE_CACHE_DIR` env var, then `/tmp/deepdive_cache` |\n| `cache_size_limit_gb` | int | 10 | Cache size limit in GB. Old entries are evicted when limit is reached |\n| `cache_ttl_seconds` | int | 604800 | Cache entry TTL in seconds (default: 1 week). Entries are re-fetched after expiry |\n| `cache_shards` | int | 8 | Number of SQLite shards for diskcache (higher reduces contention) |\n| `in_memory_cache_max_bytes` | int | 16_777_216 | Per-process in-memory cache size limit in bytes (0 disables) |\n| `in_memory_cache_max_entry_bytes` | int | 200_000 | Max entry size (bytes) stored in the in-memory cache |\n\n### Metrics\n\n| Metric        | Meaning                                                        |\n| ------------- | -------------------------------------------------------------- |\n| `correct`     | 1 if the model's answer is judged correct, 0 otherwise         |\n| `judge_confidence` | Confidence score of the judge's answer |\n| `model_confidence` | Confidence score of the model's answer |\n\nThe main `reward` metric is identical to `correct`, which returns 1.0 if the model's answer is judged correct.\n\n## Changelog\n\n- 0.1.2: Add a startup cache smoketest (write/read/delete round-trip) so misconfigured caches (wrong dir, no write permission, full disk, corrupt SQLite) raise a clear `RuntimeError` from `configure_cache` instead of silently turning every fetch into a cache-flavored error. Also shorten the TTL for cached fetch errors from `cache_ttl_seconds` (1 week) to a new `error_cache_ttl_seconds` (60s default) so transient failures don't pin a URL as broken; errors are no longer mirrored into the no-TTL mem cache.\n- 0.1.1: add `max_concurrent_search`\n","encoding":"utf-8","truncated":false,"total_bytes":4027},"status":null}