{"data":{"kind":"file","path":"README.md","version_id":"kk7555vqyk3ul1uc8mrkd7jo","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":7519,"modified_at":"2026-05-19T17:19:07.920000","content_hash":"a69fc91a469efb533b7e08b424c3f812b6781cffda7220ef182d28ebc0ba1637"},"entries":[],"content":"# DDBC RLM\n\nRLM (Recursive Language Model) environment for DDBC - BrowseComp with DeepDive tools.\n\n### Overview\n\n- **Environment ID**: `ddbc-rlm`\n- **Short description**: BrowseComp QA using RLM pattern with Google search tools for sub-LLMs.\n- **Tags**: qa, multiturn, search, tool-use, rlm\n\n### How It Works\n\nThis environment uses the Recursive Language Model (RLM) pattern:\n\n1. **Root Model**: Writes Python code in a REPL environment to orchestrate the search process\n2. **Sub-LLMs**: Called via `llm_batch(prompts)` function; have access to `search_web`, `scan_page`, and `open_lines` tools\n3. **Final Answer**: Set via `answer[\"content\"] = \"your answer\"` and `answer[\"ready\"] = True`\n\nThis pattern is useful for complex queries that benefit from decomposition and recursive reasoning.\n\n### Datasets\n\n- **Primary dataset(s)**: BrowseComp, described in [this paper](https://arxiv.org/abs/2504.12516)\n- **Source links**: [Encrypted dataset](https://openaipublic.blob.core.windows.net/simple-evals/browse_comp_test_set.csv)\n- **Split sizes**: 1,266 examples (split into train/eval)\n\n### Setup and Install\n\n```bash\nuv run vf-install ddbc-rlm\n```\n\nYou will also need an API key from [Serper](https://serper.dev/)\n\n### Eval\n\nSet all environment variables required for running the model and judge. For example, the judge defaults to Pinference's `openai/gpt-4.1-mini`, so you need to set the `PRIME_API_KEY`:\n\n```bash\nexport PRIME_API_KEY=<your-key>\nexport SERPER_API_KEY=<your-serper-key>\n```\n\nExample evaluation:\n\n```bash\nprime eval run ddbc-rlm -m gpt-4.1-mini -n 5\n```\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `max_turns` | int | 50 | Max REPL iterations |\n| `sub_model` | str | None | Model for sub-LLM calls (defaults to same as root model) |\n| `max_sub_llm_parallelism` | int | 5 | Max concurrent sub-LLM calls; the RLM can still batch more prompts than this, but their concurrency will be limited by a Semaphore |\n| `max_output_length` | int | 8192 | Max length of code execution output |\n| `code_execution_timeout` | int | 120 | Timeout in seconds for code execution |\n| `abort_on_code_timeout` | bool | False | If True, abort rollout on code timeout; if False, return error to model |\n| `max_startup_wait_seconds` | int | 120 | Max seconds to wait for sandbox worker startup |\n| `pip_install_packages` | str | \"\" | Space-separated packages to install in sandbox |\n| `sandbox_docker_image` | str | \"python:3.11-slim\" | Docker image for sandbox |\n| `sandbox_cpu_cores` | int | 1 | CPU cores for sandbox |\n| `sandbox_memory_gb` | int | 2 | Memory in GB for sandbox |\n| `sandbox_disk_size_gb` | int | 5 | Disk size in GB for sandbox |\n| `sandbox_gpu_count` | int | 0 | Number of GPUs for sandbox |\n| `sandbox_timeout_minutes` | int | 60 | Overall sandbox lifetime in minutes |\n| `sub_llm_max_turns` | int | 5 | Max tool-calling turns for each sub-LLM call |\n| `include_env_tips` | bool | False | Include environment-specific tips in prompt |\n| `prompt_in_context_file` | bool | False | Write the prompt into `context.txt` and leave the user prompt empty |\n| `serper_api_key_var` | str | \"SERPER_API_KEY\" | Env var with Serper API key |\n| `max_search_results` | int | 10 | Maximum number of search results from Serper |\n| `max_concurrent_search` | int | 10 | Maximum number of queries issued in parallel per `search_web` call. Queries beyond this limit are ignored |\n| `max_response_chars` | int \\| float | 20_000 | Truncate search results and scan/open outputs to this length |\n| `judge_model` | str | \"openai/gpt-4.1-mini\" | Judge model for evaluation |\n| `judge_api_key_var` | str | \"PRIME_API_KEY\" | Env var with judge API key |\n| `judge_base_url` | str | \"https://api.pinference.ai/api/v1\" | Base URL for judge model API |\n| `serper_timeout` | float | 15 | Timeout for search requests |\n| `open_max_workers` | int | 64 | Number of threads for URL fetching and HTML/PDF parsing |\n| `open_max_concurrency` | int | 64 | Max concurrent URL fetches per process |\n| `open_max_connections` | int | 256 | Max pooled HTTP connections per process |\n| `open_max_connections_per_host` | int | 0 | Max pooled HTTP connections per host (0 = unlimited) |\n| `cache_shards` | int | 8 | Number of SQLite shards for diskcache (higher reduces contention) |\n| `in_memory_cache_max_bytes` | int | 16_777_216 | Per-process in-memory cache size limit in bytes (0 disables) |\n| `in_memory_cache_max_entry_bytes` | int | 200_000 | Max entry size (bytes) stored in the in-memory cache |\n| `redundancy_penalty_weight` | float | 0.0 | Weight for redundancy penalty on similar search queries. Computed across all sub-LLM calls |\n| `log_level` | str \\| int | \"INFO\" | Logging level for DDBC RLM loggers (e.g., \"DEBUG\", \"INFO\") |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Accuracy (judge-based) |\n| `judge_confidence` | Confidence score of the judge's answer |\n| `sub_llm_call_count` | Number of sub-LLM calls made |\n| `sub_llm_prompt_tokens` | Total prompt tokens from sub-LLMs |\n| `sub_llm_completion_tokens` | Total completion tokens from sub-LLMs |\n| `sub_llm_total_tool_calls` | Total tool calls made by sub-LLMs |\n| `sub_llm_total_turns` | Total turns (LLM calls) made by sub-LLMs |\n| `sub_llm_batch_count` | Number of `llm_batch()` invocations |\n| `sub_llm_max_batch_size` | Max batch size (peak parallelism) in a single `llm_batch()` call |\n| `sub_llm_mean_batch_size` | Mean batch size across all `llm_batch()` invocations |\n| `main_rlm_turns` | Number of main model REPL turns |\n| `main_rlm_prompt_tokens` | Main model prompt tokens |\n| `main_rlm_completion_tokens` | Main model completion tokens |\n| `repl_total_time_seconds` | Total time spent in the REPL tool |\n| `repl_call_count` | Number of REPL tool calls |\n| `repl_mean_time_seconds` | Mean REPL tool call time |\n| `search_web_mean_queries` | Mean number of queries per `search_web` call |\n| `search_web_error_rate` | Fraction of sub-LLM `search_web` tool calls that returned errors |\n| `scan_page_error_rate` | Fraction of sub-LLM `scan_page` tool calls that returned errors |\n| `open_lines_error_rate` | Fraction of sub-LLM `open_lines` tool calls that returned errors |\n\n## Changelog\n\n- 0.1.6: Add a startup cache smoketest (write/read/delete round-trip) so misconfigured caches (wrong dir, no write permission, full disk, corrupt SQLite) raise a clear `RuntimeError` from `configure_cache` instead of silently turning every fetch into a cache-flavored error. Also shorten the TTL for cached fetch errors from `cache_ttl_seconds` (1 week) to a new `error_cache_ttl_seconds` (60s default) so transient failures don't pin a URL as broken; errors are no longer mirrored into the no-TTL mem cache. Also fix a leftover logger name in `open_one.py` (`ddbc` → `ddbc_rlm`) so this env's URL-fetch logs are no longer attributed to `ddbc`.\n- 0.1.5: Default judge requests now use Pinference (`https://api.pinference.ai/api/v1`) with `PRIME_API_KEY` and the Pinference-qualified `openai/gpt-4.1-mini` model name.\n- 0.1.4: Add `max_concurrent_search` argument to make the parallel-query limit of `search_web` user-configurable (default unchanged at 10)\n- 0.1.3: align arg names with simplified RLMEnv (`max_iterations` → `max_turns`, `sub_tool_max_turns` → `sub_llm_max_turns`, sandbox params → `sandbox_*` prefix)\n- 0.1.2: sandbox labels no longer force in the default label\n- v0.1.1: Bump to `verifiers>=v0.1.11.dev0` to support new types\n- v0.1.0: copy ddbc and introduce the RLM\n","encoding":"utf-8","truncated":false,"total_bytes":7519},"status":null}