{"data":{"kind":"file","path":"README.md","version_id":"y2llejwn8izm7eo7edvjcfw0","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":18444,"modified_at":"2026-06-13T15:46:35.337000","content_hash":"a99e07f3d5e21cecec735e608ac953dde66bed5d2babec93d33da21560a4bf4c"},"entries":[],"content":"# lean-verifier-env\n\nLean 4 theorem-proving environment for Prime Intellect / `verifiers`, built around a\nstrict verifier reward. The model receives a theorem ending in `:= by`; the environment\nchecks the model's proof with either hosted AXLE or a local Lean REPL pool. Reward is\n`1.0` only when the submitted proof verifies.\n\n### Overview\n\n- **Environment ID**: `pradheep/lean-verifier-env`\n- **Version**: `0.9.4`\n- **Tags**: lean, theorem-proving, verification, axle, eval, train, rlm\n- **Primary use**: verifier-backed Lean RLM/RL experiments with a controlled action\n  space.\n- **Not the same niche as `opencode-lean`**: `opencode-lean` is a full sandboxed\n  OpenCode agent environment with bash/edit tools. This env is a verifier-first reward\n  lab: smaller action space, easier reward audits, and cheaper local iteration.\n\n### Quickstart\n\n```bash\n# from the repo root\nuv run --project environments/lean-verifier-env --extra dev python -m pytest tests\nuv run --project environments/lean-verifier-env vf-eval lean-verifier-env -n 5 -r 1\n```\n\nFor local Lean checking:\n\n```bash\nexport LEAN_CHECKER_BACKEND=local\nexport LEAN_REPL_CMD=/path/to/repl/.lake/build/bin/repl\nexport LEAN_PROJECT_DIR=/path/to/mathlib-project\nexport LEAN_POOL_SIZE=2\n```\n\nFor hosted AXLE checking:\n\n```bash\nexport AXLE_API_KEY=your-api-key\nexport LEAN_CHECKER_BACKEND=axle\n```\n\n#### Inference providers\n\nThe env is provider-agnostic — the policy model is supplied by the harness, so\npick any provider with `prime eval run -p <provider>`:\n\n```bash\n# Prime Intellect inference (uses your `prime login` account; no extra key needed)\n# Model ids come from the Prime Inference catalog (GET api.pinference.ai/api/v1/models).\nprime eval run pradheep/lean-verifier-env -m openai/gpt-oss-120b -p prime\n# ...or pin models to Prime Inference via the shipped registry:\nprime eval run pradheep/lean-verifier-env -m openai/gpt-oss-120b -e configs/endpoints.toml\n\n# OpenRouter (needs OPENROUTER_API_KEY)\nprime eval run pradheep/lean-verifier-env -m <model> -p openrouter\n```\n\n`-p prime` resolves `https://api.pinference.ai/api/v1` + `PRIME_API_KEY`\n(falling back to `~/.prime/config.json`); `-p openrouter` needs\n`OPENROUTER_API_KEY`. The in-process leaderboard takes the same choice via\n`--provider {prime,openrouter,openai}` (see Development).\n\n### Architecture\n\nThe package exposes `load_environment` and supports five interaction modes:\n\n| Mode | Env | Behavior | Best use |\n|------|-----|----------|----------|\n| `single` | `SingleTurnEnv` | Model submits one proof; env verifies once | evals, dataset checks, cheap baselines |\n| `repair` | `LeanRepairEnv` | Env returns Lean diagnostics after failed attempts | multi-turn proof repair experiments |\n| `agentic` | `LeanToolEnv` | Model calls `lean_check(proof)` during rollout | RLM/tool-use training with a tight tool schema |\n| `step` | `StepLeanEnv` | Model emits one tactic per turn; env returns the new proof state | LeanDojo-style tactic-level interaction (local only) |\n| `rlm` | `RLMEnv` (Prime) | Root model recurses (`llm_batch`) + verifies **in-sandbox** via a baked Lean+Mathlib image | recursive subgoal decomposition on hard theorems |\n\nAll modes share the same scoring path:\n\n1. Build the Lean file from imports + theorem + model proof.\n2. Reject `sorry`, `admit`, local axioms, `unsafe`, and metaprogramming before checking.\n3. Run the configured checker.\n4. Return `1.0` only when Lean verifies the proof.\n5. On the local backend, a verified proof is additionally **axiom-audited**\n   (`#print axioms`): rejected unless its axiom set ⊆ {`propext`, `Classical.choice`,\n   `Quot.sound`}, catching `sorry`/`native_decide`/user-axiom smuggling the regex misses.\n\n**Reward shaping.** `reward_mode` adds partial credit so weak models have a gradient before\nthey can fully prove (defeats the zero-advantage filter): `binary` (default), `shaped`\n(syntax/typecheck credit), or `process` (dense first-error progress).\n\n### RLM mode (recursive, in-sandbox verification)\n\n`mode=\"rlm\"` is the recursive-language-model path. Instead of asking one model to emit a\nsingle proof string, the environment wraps Prime's `RLMEnv` and gives the root model a\nsandboxed Python REPL. The root model can decompose the theorem into smaller Lean subgoals,\ncall `llm_batch()` to ask sub-model instances for those pieces, verify candidate Lean code,\nand then assemble one final proof for the original theorem.\n\nThe intended loop is:\n\n1. The root model receives the original theorem statement.\n2. It proposes `have`-style subgoals or helper lemmas.\n3. It dispatches those subgoals to sub-model calls through `llm_batch()`.\n4. It checks returned Lean snippets with `lean_check` inside the sandbox.\n5. It assembles a complete Lean file for the original theorem.\n6. The env rubric gives reward only if the final assembled proof verifies.\n\nVerification has two layers:\n\n- **In-sandbox process feedback**: with `in_sandbox_verify=True` (default), the sandbox\n  image contains Lean 4, Mathlib, and a `lean_check` CLI. The root model can run\n  `lean_check` during its reasoning loop without needing Lean in Prime's host env\n  container.\n- **Final reward scoring**: the rubric still re-checks the final answer through this env's\n  normal verifier path. For hosted Prime runs, use `scoring_backend=\"axle\"` because the\n  env container does not have local Lean installed. For local runs, scoring can use the\n  local backend if you provide `LEAN_REPL_CMD` and `LEAN_PROJECT_DIR`.\n\nThis makes RLM different from the other modes:\n\n| Mode | What the model controls | Feedback shape |\n|------|--------------------------|----------------|\n| `single` | One proof attempt | Final verifier reward only |\n| `repair` | One proof per turn | Lean diagnostics after failed attempts |\n| `agentic` | Calls `lean_check(proof)` directly | Whole-proof verifier tool feedback |\n| `step` | One tactic per turn | Live proof-state feedback |\n| `rlm` | Recursive subproblem calls + final assembly | Sub-model proofs, sandbox checks, final strict reward |\n\nUse RLM when the theorem is too hard for a direct proof-string attempt and you want the\nmodel to explicitly decompose the work. It is heavier than `single`, `repair`, or\n`agentic`, but it can expose useful process structure on MiniF2F/ProofNet/Putnam-style\ntasks.\n\nThe sandbox image is baked from `sandbox/Dockerfile` as `lean-rlm-sandbox`. This needs no\nLean on Prime's host/env container. Because the rubric's final score runs in the env\ncontainer, set `scoring_backend=\"axle\"` for hosted runs so scoring works independently of\nthe in-sandbox verification:\n\n```bash\n# -p prime for Prime Intellect inference, or -p openrouter (needs OPENROUTER_API_KEY)\nprime eval run pradheep/lean-verifier-env -m <root_model> -p prime \\\n  -a '{\"mode\":\"rlm\",\"sub_model\":\"<sub_model>\",\"scoring_backend\":\"axle\",\"dataset_name\":\"minif2f-test\"}' \\\n  -n 5 --hosted --allow-sandbox-access --allow-tunnel-access --follow\n```\n\nSee `sandbox/README.md` for the image build/push flow and `../../RLM_SANDBOX_ROADMAP.md`\nfor status + remaining upgrades (scoring resilience, anti-cheat parity, perf).\n\n### Verification backends\n\n| Backend | What it does | Strength | Limitation |\n|---------|--------------|----------|------------|\n| `axle` | Calls the hosted Axiom Lean Engine / AXLE API | No local Lean setup; useful for smoke tests and low-throughput evals | Hosted API latency and concurrency limits; not ideal as the main RL backend |\n| `local` | Runs Lean 4 + Mathlib through a persistent `repl` worker pool | Scales with CPU workers; no AXLE request cap; best for RL training | Requires Mathlib cache + `repl` binary; each worker needs significant RAM |\n\nAXLE is kept deliberately as a developer-convenience backend. It is good for quick\nproof-checking and debugging, but the local backend is the intended training path.\n\n### Dataset presets\n\nBuilt-in dataset names:\n\n| Name | Source | Use | Notes |\n|------|--------|-----|-------|\n| `toy` | hand-written smoke split | train/eval | tiny sanity-check set |\n| `lean-workbook` | cleaned Lean Workbook rows | **train** | algebra/tactic-heavy; good for early reward signal; clean (0 overlap with the benchmarks) |\n| `minif2f-valid` | cleaned MiniF2F validation | train/eval | formal olympiad-style validation split |\n| `minif2f-test` | cleaned MiniF2F test | **eval-only** | needs `eval_split=true`; standard MiniF2F test benchmark |\n| `proofnet` | cleaned ProofNet test rows | **eval-only** | needs `eval_split=true`; standard ProofNet benchmark |\n| `putnam` | cleaned Putnam rows | **eval-only** | needs `eval_split=true`; standard Putnam benchmark |\n\n**Dataset hygiene / contamination.** `minif2f-test`, `proofnet`, and `putnam`\n*are* the standard evaluation benchmarks, so they are held out for evaluation:\nloading one for training raises unless you pass `eval_split=true` (env arg). Train\non `lean-workbook` (verified zero statement overlap with miniF2F/ProofNet/Putnam)\nand/or `minif2f-valid`, and keep the `*-test`/`proofnet`/`putnam` presets for\nevaluation:\n\n```bash\n# evaluation on a benchmark (explicit opt-in):\nprime eval run pradheep/lean-verifier-env -m <model> -p prime \\\n  -a '{\"dataset_name\": \"minif2f-test\", \"eval_split\": true}'\n```\n\nYou can also pass a path to a generated/cleaned JSONL file:\n\n```bash\nvf-eval lean_verifier_env \\\n  -a '{\"dataset_name\": \"path/to/tasks.jsonl\", \"max_difficulty\": \"easy\"}'\n```\n\nEach row uses the shared schema:\n\n```json\n{\n  \"id\": \"nat_add_zero\",\n  \"source\": \"toy\",\n  \"imports\": [\"Mathlib\"],\n  \"formal_statement\": \"theorem nat_add_zero (n : Nat) : n + 0 = n := by\",\n  \"proof\": \"simp\",\n  \"difficulty\": \"toy\",\n  \"timeout_s\": 10,\n  \"environment\": \"lean-4.28.0\"\n}\n```\n\n### Evaluation results\n\nThe table below summarizes local-backend, single-turn evals from `outputs/evals/`.\nMetric is verified pass rate (`verified == 1.0`). These are small/medium diagnostic\nruns, not a leaderboard.\n\n| Model | Dataset | Examples | Verified | Run |\n|-------|---------|----------|----------|-----|\n| `openai/gpt-oss-120b` | `lean-workbook` | 25 | **100.0%** | `0a96060c` |\n| `openai/gpt-oss-120b` | `lean-workbook` | 29 | **93.1%** | `ab6656ff` |\n| `qwen/qwen3.5-9b` | `lean-workbook` | 25 | **92.0%** | `a7785432` |\n| `qwen/qwen3.5-9b` | `lean-workbook` | 29 | **89.7%** | `a22f4551` |\n| `deepseek/deepseek-v4-flash` | `lean-workbook` | 30 | **86.7%** | `c80dd9fa` |\n| `openai/gpt-oss-120b` | `minif2f-test` | 57 | **28.1%** | `fcd22677` |\n| `deepseek/deepseek-v4-flash` | `minif2f-test` | 44 | **38.6%** | `191821e7` |\n| `deepseek/deepseek-v4-flash` | `minif2f-test` | 30 | **23.3%** | `10f2a096` |\n| `qwen/qwen3.5-9b` | `minif2f-test` | 30 | **20.0%** | `00954a36` |\n| `openai/gpt-oss-120b` | `proofnet` | 30 | **13.3%** | `2a8318b0` |\n| `deepseek/deepseek-v4-flash` | `proofnet` | 30 | **0.0%** | `9c51fe09` |\n| `openai/gpt-oss-120b` | `putnam` | 30 | **0.0%** | `7756f065` |\n| `deepseek/deepseek-v4-flash` | `putnam` | 23 | **0.0%** | `30d75c9c` |\n\nTakeaways:\n\n- The environment is already useful for measuring proof-string success on easy and\n  medium algebraic Lean Workbook tasks.\n- MiniF2F has a nonzero signal but is much harder.\n- ProofNet and Putnam remain mostly unsolved in these single-turn runs, which makes\n  them better suited for curriculum, repair, or agentic modes.\n\n### Comparison with `opencode-lean`\n\n`primeintellect/opencode-lean` uses `ComposableEnv` with `LeanTaskSet` and the\nOpenCode harness. The model gets bash/edit tools inside a Prime sandbox and edits a\nproof file in a version-matched Mathlib image. It is the better fit for realistic\ncoding-agent theorem proving.\n\nThis env is intentionally narrower:\n\n- The model usually submits proof text, not arbitrary shell commands.\n- `agentic` mode exposes only `lean_check(proof)`; the env injects the real theorem\n  into each tool call.\n- Reward is easier to audit because it goes through a single checker path.\n- AXLE can be used for low-throughput smoke tests; local Lean is used for training.\n\nUse `opencode-lean` when you want repo-like agent behavior. Use `lean-verifier-env`\nwhen you want controlled verifier reward experiments, curriculum construction, and\nRLM training signal design.\n\n### Running modes\n\nSingle turn:\n\n```bash\nvf-eval lean_verifier_env \\\n  -a '{\"dataset_name\": \"minif2f-test\", \"mode\": \"single\", \"checker_backend\": \"local\"}'\n```\n\nRepair loop:\n\n```bash\nvf-eval lean_verifier_env \\\n  -a '{\"dataset_name\": \"minif2f-test\", \"mode\": \"repair\", \"max_turns\": 3, \"checker_backend\": \"local\"}'\n```\n\nAgentic tool-use:\n\n```bash\nvf-eval lean_verifier_env \\\n  -a '{\"dataset_name\": \"minif2f-test\", \"mode\": \"agentic\", \"max_turns\": 4, \"checker_backend\": \"local\"}'\n```\n\n### Training with prime-rl\n\nTraining configs live in `configs/rl/`. The local-backend configs assume a container\nwith Lean, Mathlib, and `repl` installed at the paths used in\n`environments/lean-verifier-env/Dockerfile`.\n\n```bash\nprime rl run configs/rl/lean_agentic_local.toml\nprime rl run configs/rl/lean_repair_local.toml\n```\n\nExecution choices:\n\n| Path | When to use |\n|------|-------------|\n| AXLE backend | smoke tests, quick evals, hosted proof checking without local Lean |\n| local backend | high-throughput RL, many rollouts, avoiding hosted checker caps |\n| OpenCode / sandbox env | long-horizon file-editing theorem proving |\n\n### Development\n\nFrom the repository root:\n\n```bash\nuv run --project environments/lean-verifier-env --extra dev python -m pytest tests\nuv build --project environments/lean-verifier-env\nprime env push --path environments/lean-verifier-env --plain\n```\n\nIn-process tiered leaderboard (pass@1 + pass@k, no `prime eval` subprocess).\n`--provider prime` uses Prime Intellect inference via your `prime login` account;\n`--provider openrouter` needs `OPENROUTER_API_KEY`:\n\n```bash\npython scripts/run_leaderboard.py --provider prime \\\n  --models openai/gpt-oss-120b,meta-llama/llama-3.3-70b-instruct \\\n  --datasets minif2f-test,proofnet,putnam \\\n  --num-examples 20 --rollouts-per-example 4 \\\n  --backend local --lean-repl-cmd /path/repl --lean-project-dir /path/mathlib\n```\n\n### Changelog\n\n#### v0.9.x\n\n- Current line: `0.9.7`.\n- `v0.9.7` — **Contamination guardrail**: the eval-only benchmark presets\n  (`minif2f-test`, `proofnet`, `putnam`) now require `eval_split=true` to load,\n  so a training run can't silently train on a test benchmark. Train on the clean\n  `lean-workbook` split (verified zero overlap with miniF2F/ProofNet/Putnam) or\n  `minif2f-valid`. Presets are labelled train vs eval-only in the README.\n- `v0.9.6` — **Prime Intellect inference support**: the in-process leaderboard\n  (`scripts/run_leaderboard.py`) now takes `--provider prime` (alongside\n  `openrouter`/`openai`), reusing verifiers' `setup_client`/`ClientConfig` so it\n  picks up `prime login` credentials from `~/.prime/config.json`. Added a shipped\n  `configs/endpoints.toml` registry routing models to Prime Inference, and\n  documented `prime eval run -p prime`.\n- `v0.9.5` — **Local Lean backend fix**: `LocalLeanChecker` now resolves the Lean\n  toolchain bin dir (via `shutil.which`, falling back to `~/.elan/bin`) onto the\n  subprocess PATH for both `lake env printenv` and the `repl` worker, so the\n  backend works when the running venv didn't inherit elan (previously failed with\n  a cryptic \"empty response from Lean REPL\").\n- `v0.9.4` — **Hosted-scoring decoupling**: `load_environment(mode=\"rlm\",\n  scoring_backend=\"axle\")` builds a separate scoring checker so the rubric scores via AXLE\n  in Prime's Lean-less env container while the model verifies in-sandbox. Added the\n  in-process leaderboard harness (`scripts/run_leaderboard.py`) with `pass@1` + `pass@k`\n  (`report.pass_at_k`), replacing the flaky `prime eval` subprocess path.\n- `v0.9.3` — **RLM sandbox image**: baked Lean 4 + Mathlib into a Prime-registry image\n  (`sandbox/Dockerfile`, `lean_check` CLI) so `mode=\"rlm\"` verifies **in-sandbox**\n  (`in_sandbox_verify=True` default); no Lean needed on the host container. `lean_check`\n  rejects `sorry`/`admit`. Validated on real Prime infra.\n- `v0.9.0`–`v0.9.2` — Added `step` mode (`StepLeanEnv`, tactic-by-tactic via the repl\n  proofState API), the `rlm` wrapper over Prime's `RLMEnv`, dense `process`/`shaped`\n  `reward_mode`s, post-verification axiom audit + expanded forbidden-construct filters\n  (`native_decide`, `ofReduceBool`, …), and the Mathlib-syntax modernizer for fair\n  ProofNet/Putnam parsing. Made multi-turn env signatures robust across `verifiers`\n  versions (0.1.5 ↔ 0.1.14).\n\n#### v0.8.x\n\n- Current line: `0.8.3`.\n- Added cleaned dataset presets for Lean Workbook, MiniF2F, ProofNet, and Putnam.\n- Added local-backend eval runs and result reporting.\n- Expanded RLM-oriented modes and configs.\n\n#### v0.7.0\n\n- Added `agentic` mode with a constrained `lean_check(proof)` tool.\n- Added local Lean Dockerfile and prime-rl configs.\n\n#### v0.6.0\n\n- Restructured the package into `environments/lean-verifier-env/` and published a clean\n  Hub source archive.\n\n#### v0.5.0\n\n- Made the environment Hub-ready with package metadata, README documentation, and\n  reproducible build commands.\n- Added the first evaluation report path for comparing model proof success under strict\n  Lean verification.\n- Kept prime-rl integration as a consumer of the stable environment API, not as the\n  source of reward logic.\n\n#### v0.4.0\n\n- Added curriculum-oriented dataset plumbing and task buckets.\n- Added task-generation support from Lean declarations via AXLE extraction tools.\n- Restricted AXLE proof-repair and declaration-extraction APIs to dataset generation\n  workflows, never final reward scoring.\n\n#### v0.3.0\n\n- Added the multi-turn repair environment where the model sees normalized Lean\n  diagnostics and can retry.\n- Preserved strict final scoring: only the model's submitted proof can receive reward.\n- Added tests around retry behavior and diagnostic feedback.\n\n#### v0.2.0\n\n- Added Lean diagnostic normalization and monitor metrics such as syntax/typecheck\n  status, proof length, timeout, and forbidden-construct flags.\n- Added task filtering for unsafe proof shortcuts, including `sorry`, `admit`, local\n  axioms, and unsafe bypasses.\n- Kept the main reward binary: `1.0` for verified proofs, `0.0` otherwise.\n\n#### v0.1.0\n\n- Established the minimal single-turn Lean proof-completion environment.\n- Added fixed-import theorem tasks and AXLE-backed proof checking through `AxleClient`.\n- Added the initial binary reward path using strict Lean verification.\n","encoding":"utf-8","truncated":false,"total_bytes":18444},"status":null}