{"data":{"kind":"file","path":"README.md","version_id":"on0hllk0m0hkg033fo3ydqtl","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":9559,"modified_at":"2026-06-01T19:55:33.114000","content_hash":"bbfc074c9fa314597cf59df194e52cfcab97896be1a62b04dc84e8a9d820c9cd"},"entries":[],"content":"# general-agent\n\n<a href=\"https://github.com/PrimeIntellect-ai/research-environments/tree/main/environments/general_agent\">\n<img src=\"https://img.shields.io/badge/GitHub-181717?style=for-the-badge&logo=github&logoColor=white\" alt=\"Source Code\">\n</a>\n\nA **self-growing** toolbench environment for training agents on diverse tool-use tasks. Unlike a fixed-corpus benchmark, `general-agent` ships the full loop that evolves the corpus itself:\n\n> **synthesizer → solver → synthesizer → ...**\n>\n> A synthesizer agent designs a new task family and its tiers (t0 → t4); each tier's difficulty is measured live by running the local solver against it. Tiers whose pass rate lands in the target band are kept; the rest are dropped. Hard tiers feed back as the starting point for the next wave of extensions, letting the corpus drift progressively harder over time.\n\nShips a large corpus of fully synthetically-created tasks produced by this loop, plus three solver backends (Local / OpenCode / RLM) for training and evaluation.\n\n### Corpus\n\n**4,417 tasks across 1,037 families**, distributed across 5 difficulty tiers\n(synthesized using GLM-5.1 with `gpt-5-mini` for local-solver pass-rate gating):\n\n| tier | tasks | avg tools | avg gold steps | gpt-5-mini (local, k=20) | GLM-5.1-FP8 (rlm, k=50) |\n|---|---:|---:|---:|---:|---:|\n| t0 | 1,035 | 6.3 | 2.5 | 0.927 | 0.961 |\n| t1 |   978 | 9.0 | 8.7 | 0.757 | 0.928 |\n| t2 |   928 | 11.4 | 13.3 | 0.601 | 0.895 |\n| t3 |   803 | 13.4 | 17.2 | 0.407 | 0.863 |\n| t4 |   673 | 14.9 | 20.5 | 0.251 | 0.792 |\n| **all** | **4,417** | | | **0.627** (n=3,958) | **0.896** (n=4,417) |\n\nRun `uv run general-agent stats` for the live numbers and the per-tier\npass-rate distribution.\n\n### Overview\n\n- **Environment IDs**:\n  - `general-agent` (default → local solver)\n  - `general-agent-solver-local` — Local (in-process) solver, no sandbox\n  - `general-agent-solver-opencode` — OpenCode agent in a sandbox, tools via local MCP server\n  - `general-agent-solver-rlm` — RLM agent in a sandbox, programmatic tools via skills\n  - `general-agent-synth` — OpenCode agent creating new task families end-to-end\n- **Short description**: Agent environment for multi-turn tool-use tasks with DB-hash and `verify(db)` scoring.\n- **Tags**: `agent`, `multi-turn`, `tool-use`\n\n### Docs\n\n- **[CLI](docs/cli.md)** — `general-agent` subcommands: [`list`](docs/cli.md#list) · [`show`](docs/cli.md#show) · [`validate`](docs/cli.md#validate) · [`stats`](docs/cli.md#stats) · [`serve`](docs/cli.md#serve).\n- **[Reference](docs/reference.md)** — architecture and the synth↔solver loop: [Phase 1: Synthesize](docs/reference.md#phase-1-synthesize), [Phase 2: Solve](docs/reference.md#phase-2-solve), [Architecture diagrams](docs/reference.md#architecture).\n- **[Tests](docs/tests.md)** — test layout + fixtures.\n\n### Quickstart\n\n```bash\n# Local solver (fastest — no sandbox)\nuv run vf-eval general-agent -a '{\"task\":\"calendar_scheduling_t0\"}' -m gpt-5-mini\n\n# OpenCode solver (sandboxed MCP)\nuv run vf-eval general-agent-solver-opencode -a '{\"task\":\"calendar_scheduling_t0\"}' -m gpt-5-mini\n\n# RLM solver (sandboxed IPython kernel + skills)\nuv run vf-eval general-agent-solver-rlm \\\n  -a '{\"task\":\"calendar_scheduling_t0\",\"local_checkout\":\"~/rlm\"}' -m gpt-5-mini\n\n# Synthesize a new task family (~5-10 min)\nuv run vf-eval general-agent-synth -m gpt-5-mini\n```\n\n### CLI\n\n```bash\nuv run general-agent list                   # compact per-family summary\nuv run general-agent show                   # show a random task\nuv run general-agent show <task_or_family>  # show a specific task (or all tiers in a family)\nuv run general-agent validate               # replay gold + verify for every task (exit 1 on failure)\nuv run general-agent validate --fail-only   # just the failing task names, one per line\nuv run general-agent stats                  # corpus difficulty, complexity, distribution\nuv run general-agent serve <task>           # stdio MCP server for one task's tools\n```\n\n### Task layout\n\nEach task directory under `tasks/<name>/` contains:\n\n```\ntasks/calendar_scheduling_t0/\n├── task.toml            # metadata: tier, parent, description, difficulty_methods, [[pass_rates]]\n├── db.json              # initial database state\n├── tools.py             # TaskDB(DB), TaskTools(Tools) with @tool methods + verify(db) -> float\n├── gold.json            # canonical tool-call chain\n└── instruction.md       # agent-facing prompt\n```\n\nTiers 0-4 within a family are not required to share a schema. Higher tiers are typically a **superset** of the parent's — new tables, new `@tool` methods, richer `verify()` — plus constraints declared in `difficulty_methods` (e.g. `cross_entity_coupling`, `schema_extension`, `multi_step_reasoning`) that the higher-tier instruction forces the agent to navigate.\n\n### Environment Arguments\n\nSolver envs (`general-agent`, `general-agent-solver-*`):\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `task` | str \\| null | null | Exact task (`calendar_scheduling_t0`) or family (`calendar_scheduling`). |\n| `min_tier` / `max_tier` | int \\| null | null | Filter by tier difficulty (0-4). |\n| `min_pass_rate` / `max_pass_rate` | float | 0.0 / 1.0 | Filter tasks by recorded pass rate. Defaults are a no-op; narrowing excludes tasks lacking a measurement under `pass_rate_key`. |\n| `pass_rate_key` | (str, str) | `(\"openai/gpt-5-mini\", \"local\")` | (model, solver) tuple selecting which `[[pass_rates]]` entry to compare against. |\n| `tasks_dir` | path | bundled `tasks/` | Override the task corpus directory. |\n| `max_turns` | int | 100 | Hard stop for agent turns. |\n| `timeout_seconds` | float | 3600.0 | Per-rollout wall clock cap. |\n| `sandbox_labels` | list\\[str\\] \\| null | null | **OpenCode/RLM only.** Labels visible in the Prime sandbox dashboard. |\n| `local_checkout` | str \\| null | null | **RLM only.** Path to a local `rlm` checkout; avoids cloning from GitHub. |\n| `behavior_judge_model` | str \\| null | null | **RLM only.** Enables behavior-only reward shaping for solved rollouts when set. |\n| `behavior_judge_base_url` | str \\| null | `https://api.pinference.ai/api/v1` | **RLM only.** Behavior judge API base URL. |\n| `behavior_judge_api_key_var` | str \\| null | `PRIME_API_KEY` | **RLM only.** Env var containing the behavior judge API key. |\n| `behavior_judge_sampling_args` | dict \\| null | null | **RLM only.** Extra sampling args for the behavior judge request. |\n| `behavior_reward_alpha` | float | 1.0 | **RLM only.** Weight on behavior reward; `task_reward = max(db_hash, verify)`, `behavior_reward` is `0.0` unless `task_reward == 1.0`, and `final_reward = task_reward + alpha * behavior_reward`. |\n\nSynthesizer env (`general-agent-synth`):\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `max_tier` | int | 4 | Highest tier the synth agent should produce (0 = seed only). |\n| `synthesizer_model` | str | `gpt-5-mini` | Model for the synth OpenCode agent. |\n| `solver_model` | str | `openai/gpt-5-mini` | Model the synth uses for local-solver pass-rate gating. |\n| `solver_base_url` | str | `https://api.pinference.ai/api/v1` | API base URL for the solver model. |\n| `solver_api_key_var` | str | `PRIME_API_KEY` | Env var name holding the solver API key. |\n| `timeout_seconds` | float | 36000.0 | Per-rollout wall clock cap. |\n| `sandbox_labels` | list\\[str\\] \\| null | null | Labels visible in the Prime sandbox dashboard. |\n| `skip_extract` | bool | false | If true, don't pull generated tasks into the local `tasks/` dir (smoke tests). |\n\n### Metrics\n\nSolver envs (`general-agent`, `general-agent-solver-*`):\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` / `score` | `max(db_hash, verify)` — reward from whichever check passed. |\n| `db_hash` | 1.0 if the agent's final DB matches the gold-solution DB hash. |\n| `verify` | `verify(db) -> float` from the task's `tools.py` on the agent's final DB. |\n| `num_turns` | Agent turns in the rollout. |\n| `total_tool_calls` | Tool calls emitted by the agent (per-tool counts surfaced alongside). |\n| `sandbox_oom` / `sandbox_timeout` / `agent_timeout` / `agent_error` | Sandboxed solvers only (OpenCode / RLM). Binary failure flags. |\n| `rlm_*` | RLM only. Session stats — `rlm_turns_since_last_summarize`, `rlm_branch_output_tokens_mean`, etc. |\n\nSynthesizer env (`general-agent-synth`):\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` / `success` | Fraction of extracted tasks that passed `general-agent validate` — gated to 0.0 if the family has fewer than 5 unique `difficulty_methods`, or if `skip_extract=true`. |\n\n### Changelog\n\n#### v0.1.4\n- Render the behavior judge's user prompt as a plaintext\n  `[role]\\n<content>` conversation built from\n  `state[\"prompt\"] + state[\"completion\"]` instead of dumping the raw\n  trajectory JSON. Tool calls render as\n  `[tool_call: <name>]\\n<arguments>`. Reasoning fields\n  (`reasoning_content`, `thinking_blocks`) are omitted by construction —\n  behavior is judged on the agent's observable actions, not its private\n  chain-of-thought, and this keeps the 60k-char budget from being eaten\n  by verbose reasoning traces on reasoning-capable models.\n\n#### v0.1.3\n- Persist the behavior judge prompt to rollout state under\n  `behavior_judge_prompt` (`{\"system\", \"user\"}`). Useful for inspecting\n  exactly what the judge sees — e.g. confirming whether agent\n  `reasoning_content` makes it into the judged trajectory. Save it with\n  `vf-eval -C behavior_judge_prompt`.\n","encoding":"utf-8","truncated":false,"total_bytes":9559},"status":null}