{"data":{"kind":"file","path":"README.md","version_id":"mf4bg3aaadawovp5o2xw6pj0","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3844,"modified_at":"2026-06-20T06:20:12.810000","content_hash":"033005192754758e8901127fbe1320c48dc9232e09e1e9ba25f64e818c823ef6"},"entries":[],"content":"# spec-faithfulness — a mechanically-checked spec-writing environment for `verifiers`\n\nA PrimeIntellect [`verifiers`](https://github.com/PrimeIntellect-ai/verifiers) environment where the model\nwrites a **formal specification** (Verus `requires`/`ensures` clauses) for a function, and the reward is a\n**differential check against a real [Verus](https://github.com/verus-lang/verus) verifier** — not an LLM\njudge, not a string match.\n\n## The reward: a faithful spec accepts the right impl and rejects the wrong ones\n```\nreward = 0                          if the spec rejects the reference (correct) implementation  (too strong / broken)\n       = (# buggy impls rejected) / (# buggy impls)   otherwise\n       = 1.0                        iff it also rejects every known-buggy implementation        (faithful)\n```\nThis is the mechanical core of **Verus-SpecGym** ([arXiv 2605.26457](https://arxiv.org/abs/2605.26457)): an\nunfaithful spec fails in one of two ways — **too strong** (rejects a correct program) or **too weak** (accepts\na buggy one). Both are caught here with **no expert gold spec and no LLM judge** (the paper reports LLM judges\nmiss ~26% of the failures a verifier catches). The graded reward (e.g. 1.0 / 0.67 / 0.0) is a clean RL signal.\n\n## Why it's hard to game\n- The model supplies **only the spec clauses**; the env fixes the signature and **all** implementations, so\n  the policy cannot prove an easier function.\n- Vacuous specs are **self-defeating**: `ensures true` / `requires false` let the buggy impls verify too, so\n  they reject nothing → low reward. `assume`/`admit`/`external_body`/`no_verify` are blocked outright (and\n  tracked by a `not_cheating` monitor).\n- The check is real Verus/SMT verification, so a plausible-but-unsound spec scores by what it actually rules out.\n\n## Use\n```bash\n# 1. a `verus` binary on PATH (https://github.com/verus-lang/verus — prebuilt container or `rustup`+toolchain)\nexport VERUS_CMD=verus\n# 2. install + eval (the policy needs an inference model; the env itself is GPU-free)\nprime env install spec-faithfulness\nprime eval run spec-faithfulness --model <your-model>\n```\n`load_environment(verus_cmd=None, timeout=60.0)` — `verus_cmd` defaults to `$VERUS_CMD` split (else `[\"verus\"]`).\n\n## Validate the reward without a policy model\n`test_local.py` feeds, for `max`, a faithful spec (→ 1.0), a too-weak spec (→ 0.67, lets a buggy impl pass),\na broken spec that rejects the correct impl (→ 0.0), and an `assume` cheat (→ 0.0) — straight at a real Verus:\n```bash\nVERUS_CMD=verus python test_local.py    # -> faithful 1.0 / too_weak 0.67 / broken 0.0 / cheat 0.0 ; PASS\n```\n\n## Status / limits (v0.1)\n- **Seed problem set (16): `max`/`min`/`max3`/`min3`/`clamp`, `is_even`/`is_odd`, `leq`/`geq`/`equal`/\n  `not_equal`/`in_range`, `logical_and`/`or`/`xor`/`implies`** — each with a correct impl + 3–4 buggy mutants\n  (56 total). Chosen to be clean in Verus (bool / comparison / min-max — no overflow), so a mutant fails for\n  the *spec* reason, not an incidental arithmetic error. Every problem is checked well-formed by `gold_check.py`\n  (a hand-written faithful spec scores exactly 1.0 — correct verifies + all mutants rejected). **Not yet the\n  full Verus-SpecBench (581 Codeforces tasks)** — that integration + auto-mutant-generation is the v0.2.\n- Single-turn (no spec-repair loop yet); a `MultiTurnEnv` that feeds the verifier's counterexample/error back\n  is the natural follow-up (same pattern as the sibling `lean-verifier` env).\n- Pins Verus only by the toolchain you install; record the Verus version for reproducible results.\n- Buggy mutants are hand-authored; some incorrect impls may fail for an incidental reason (e.g. arithmetic\n  overflow) rather than the spec — kept the seed mutants clean, but auto-generated mutants need that guard.\n","encoding":"utf-8","truncated":false,"total_bytes":3844},"status":null}