{"data":{"kind":"file","path":"README.md","version_id":"ry9t04pw47ccokge60pzfoj1","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":6407,"modified_at":"2026-03-15T19:16:39.799000","content_hash":"f6b5c5a7cfd3489efc3e8272a3b6ef8e7f94c852c9d296f615440f6c3a4b15e7"},"entries":[],"content":"# Autoresearch\n\nAutonomous LLM training research as a Verifiers environment, trainable with prime-rl.\n\nThe agent gets a sandbox with the [autoresearch](https://github.com/karpathy/autoresearch) repo. It may only edit `train.py` and run 5-minute training experiments via the root tool. Goal: **minimize validation bits-per-byte (val_bpb)**. Reward: `1 / (1 + best_val_bpb)` (lower val_bpb ⇒ higher reward).\n\n## Overview\n\n- **Environment ID**: `autoresearch`\n- **Type**: RLMEnv — Bash REPL + root tools (e.g. `run_training_tool`) + optional sub-LLM\n- **Goal**: Achieve the lowest validation bits-per-byte on the fixed eval set by modifying `train.py` and running experiments.\n- **Key tools**:\n  - **call_bash_repl**: Edit files (e.g. `train.py`) and run shell commands in the sandbox.\n  - **run_training_tool()**: Root tool that runs one 5-minute experiment (`uv run train.py` in the repo), parses `val_bpb` from output, updates best/run counts in state, and returns val_bpb and log summary.\n  - **llm_batch()**: Optional sub-LLM for subtasks (e.g. summarising logs).\n- **Completion**: When done, the agent must export `RLM_READY=1` and set `RLM_CONTENT` to a short summary of the best val_bpb.\n- **Metrics**: `num_runs` (training runs executed), `best_val_bpb` (best val_bpb achieved), plus RLM monitor metrics.\n\n## Quickstart\n\n### Run (from repo root)\n\nRequires Prime Sandboxes (and optionally GPU). Set `PRIME_API_KEY` and configure endpoints in `configs/endpoints.toml` if needed:\n\n```bash\nprime eval run autoresearch -n 2 -m openai/gpt-4.1-mini\n```\n\nWith custom config (e.g. more turns, fewer examples):\n\n```bash\nprime eval run autoresearch -n 2 -m openai/gpt-4.1-mini -a '{\"max_turns\": 20, \"num_examples\": 3}'\n```\n\n### Run (from `autoresearch/`)\n\n```bash\ncd autoresearch\nuv sync\nuv run prime eval run autoresearch -n 2 -m openai/gpt-4.1-mini\n```\n\n### Programmatic\n\n```python\nfrom autoresearch.autoresearch import load_environment\n\nenv = load_environment()\n# or with options:\nenv = load_environment({\"max_turns\": 20, \"num_examples\": 3})\n```\n\n### Hosted RL\n\nTo use the environment with [Lab Hosted Training](https://docs.primeintellect.ai):\n\n1. Push the environment to the Hub (once):\n   ```bash\n   prime env push --path ./autoresearch -v PRIVATE\n   ```\n2. Use your Hub env ID (e.g. `YOUR_USERNAME/autoresearch`) in your RL config and run:\n   ```bash\n   prime rl run <your-config.toml> -e WANDB_API_KEY -e OPENAI_API_KEY\n   ```\n\n## Config\n\n`load_environment(config, *, dataset_builder=None, **kwargs)` accepts:\n\n- **config**: A `Config` instance, a dict (e.g. from `prime eval run -a '...'`), or `None`. Dict keys can be overridden by **kwargs**.\n- **dataset_builder**: Optional callable `(num_examples: int) -> Dataset`. If provided, it is called with `num_examples` to build the dataset; each row must have `\"question\"`, `\"task\"`, and optionally `\"info\"`. If `None`, a default synthetic dataset is used (same question repeated `num_examples` times).\n\nConfig fields (when passing a dict or `Config`):\n\n| Field | Type | Default | Description |\n|-------|------|---------|-------------|\n| **max_turns** | int | 10 | Max agent turns (each turn can include tool calls). |\n| **docker_image** | str | `\"ghcr.io/astral-sh/uv:python3.12-bookworm\"` | Sandbox Docker image (needs git, uv; use CUDA image if `gpu_count > 0`). |\n| **working_dir** | str | `\"/workspace/autoresearch\"` | Path inside the sandbox to the autoresearch repo. |\n| **timeout_per_command_seconds** | int | 30 | Timeout for Bash REPL commands (e.g. edits). |\n| **timeout_per_training_run_seconds** | int | 600 | Timeout for each `run_training_tool()` run (`uv run train.py`). |\n| **gpu_count** | int | 1 | GPUs for the sandbox (1 recommended for training). |\n| **memory_gb** | int | 16 | RAM for the sandbox. |\n| **disk_size_gb** | int | 20 | Disk for the sandbox. |\n| **sandbox_timeout_minutes** | int | 60 | Sandbox lifetime timeout (minutes). |\n| **sandbox_cpu_cores** | int | 1 | CPU cores for the sandbox. |\n| **start_command** | str \\| None | None | If set, used as the sandbox start command; else default: clone repo, `uv sync`, `uv run prepare.py --num-shards 2`, then `exec tail -f /dev/null`. |\n| **num_examples** | int | 5 | Number of dataset rows (for default synthetic dataset). |\n| **repo_url** | str | `\"https://github.com/karpathy/autoresearch.git\"` | Git URL for the autoresearch repo. |\n| **context_dir_name** | str | `\"contexts\"` | Directory name for per-example context (e.g. `contexts/0/`, `contexts/1/`). |\n\nValidation: `num_examples >= 1`, `timeout_per_command_seconds >= 1`, `timeout_per_training_run_seconds >= 1`.\n\n## Dataset rows and `info` (context_dir / context)\n\nRLMEnv passes each dataset row’s **info** into `state[\"info\"]` and uses it when building the REPL filesystem:\n\n- **info[\"context_dir\"]**: Path to a directory on the host. RLMEnv copies this directory into the rollout’s REPL fs before the worker starts (e.g. per-example configs or data).\n- **info[\"context\"]**: Optional JSON-serializable data; RLMEnv writes it to a file in the REPL fs (legacy builtin context).\n\nDefault rows use `info: {}`. To supply context, use **dataset_builder** and return rows with `\"info\": {\"context_dir\": \"/path/to/dir\"}` or `\"info\": {\"context\": {...}}`. Prefer placing context under **contexts/** (e.g. `contexts/0/`, `contexts/1/`) and set `info[\"context_dir\"]` to the resolved path.\n\n## Sandbox setup\n\nBy default, `load_environment()` builds a **start_command** that: clones the autoresearch repo into **working_dir**, runs `uv sync` and `uv run prepare.py --num-shards 2`, then keeps the container alive with `exec tail -f /dev/null`. For production you may want:\n\n- A **custom Docker image** with the repo and data pre-baked (set `docker_image` and `start_command`).\n- **GPU**: `gpu_count=1` and a CUDA image so `uv run train.py` can use the GPU.\n\n## Reward and metrics\n\n- **Reward**: `1 / (1 + best_val_bpb)`. Lower val_bpb ⇒ higher reward. If no successful run with a parsed val_bpb, reward is 0.\n- **num_runs**: Number of training runs executed in the rollout (from `run_training_tool`).\n- **best_val_bpb**: Best val_bpb achieved (lower is better; -1 if none).\n\n## Requirements\n\n- Prime Sandboxes (and optionally GPU quota) for running training in the sandbox.\n- No extra API keys beyond what Verifiers/prime use for inference.\n\n## Development\n\n- **Package manager**: `uv`\n- **Lint**: `ruff` (if configured)\n","encoding":"utf-8","truncated":false,"total_bytes":6407},"status":null}