{"data":{"kind":"file","path":"README.md","version_id":"osfdc2c5x3a1fp014n4z41pz","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":5363,"modified_at":"2026-05-29T14:50:23.213000","content_hash":"a389b3cfa38b11b3ab991c9402d255b5fd7a96dc0c599840073ab583d1e5fede"},"entries":[],"content":"# spec_rl — code RL on a DFlash-speculated endpoint\n\nA small [`verifiers`](https://github.com/PrimeIntellect-ai/verifiers) environment\nfor the combined hackathon thesis:\n\n> **Lossless DFlash speculative decoding makes RL post-training cheaper.**\n\n`spec_rl` is a HumanEval-style code-completion task. The policy model\n(Laguna XS.2) is given a function signature + docstring and must write the body.\nThe `@vf.reward` `code_reward` function executes that body against the problem's\nunit tests and returns the **fraction of assertions that pass** (a value in\n`[0,1]`) via `fraction_passing(problem, text)`. This is a *unit-test-grounded,\nverifiable, dense* reward — exactly the kind verifiers RL is built for. A\nfractional (rather than binary all-or-nothing) reward avoids GRPO all-zero-group\nadvantage collapse on hard prompts, where every rollout would otherwise score\n`0.0`. The reported pass@1 **eval** stays binary (`evals/humaneval_subset.py`):\nreward is the learning signal, eval is the scoreboard.\n\n## The point\n\n`verifiers` runs RL rollouts against an OpenAI-compatible endpoint declared in\n`./configs/endpoints.toml`. Point that endpoint at the **DFlash-speculated vLLM\nserver** instead of a plain one and you get the **same reward curve at higher\nrollout throughput**:\n\n- Speculative decoding is **lossless** under greedy decoding. The 0.6B DFlash\n  drafter proposes `num_speculative_tokens = 7` tokens; the target model\n  (Laguna XS.2) verifies them, so accepted text is **token-identical** to the\n  no-speculator baseline.\n- The reward depends only on the generated text, so an identical reward signal\n  is produced.\n- Only the **cost per rollout** drops (fewer target-model forward passes per\n  accepted token → higher tokens/sec → cheaper RL).\n\nThat is the measurable claim: feed the same env two endpoints (baseline vs\nDFlash), show one reward curve, two throughputs.\n\n## How the reward works\n\n1. The dataset carries each HumanEval problem's original `prompt` (signature +\n   docstring), `test` (the `check(candidate)` harness), and `entry_point` in\n   `info` — so the grader never depends on the model echoing the signature.\n2. The model's completion is trimmed at the first stop sequence\n   (`\\nclass `, `\\ndef `, `\\n#`, `\\nif __name__`) so a chatty model can't smuggle\n   a second definition past the grader. This matches `evals/humaneval_subset.py`.\n3. `spec_rl.fraction_passing()` assembles `prompt + completion + test +\n   check(entry_point)` and runs it in a **fresh `python` subprocess with an 8s\n   wall-clock timeout**, isolated from the rollout worker. It AST-instruments each\n   `assert` in the HumanEval `check()` (via `_AssertCounter`) so a failing assert\n   is **counted in the denominator instead of aborting on the first failure** —\n   this also makes loop-based checks fractional. The reward is `passed_asserts /\n   total_asserts`, a value in `[0,1]`. A crash, exception, or timeout before any\n   assertion runs → `0.0`; every assertion passing → `1.0`.\n\nThe execution + pass/fail logic is plain stdlib and importable without\n`verifiers` or a GPU, so it is unit-testable locally on Apple Silicon. A built-in\nsmoke test runs with:\n\n```bash\npython spec_rl.py   # checks passing / failing / timeout completions\n```\n\n> **Safety:** this executes model-generated code to grade it. Each candidate\n> runs in a short-lived, isolated subprocess. Run RL rollouts only in the\n> disposable venue sandbox, never against real data.\n\n## Layout\n\n```\nspec_rl/\n  spec_rl.py      # load_environment(num_examples=20) -> vf.Environment\n  pyproject.toml  # name = \"spec-rl\", depends on verifiers + datasets\n  README.md\n```\n\n`load_environment(num_examples=20)` builds a `vf.SingleTurnEnv` over the first\n`num_examples` HumanEval problems with a `vf.Rubric` wrapping the `@vf.reward`\n`code_reward` function (which scores via `fraction_passing`).\n\n## Run it\n\nInstall the env, then evaluate Laguna XS.2 through it:\n\n```bash\nprime env install spec_rl\nprime eval run spec_rl -m poolside/Laguna-XS.2 -n 20\nprime eval view\n```\n\n`-m poolside/Laguna-XS.2` resolves to whatever endpoint you alias in\n`./configs/endpoints.toml`. To show the cheaper-rollout result, define two\naliases pointing at the same model — one plain vLLM server, one DFlash-speculated\nserver — and run the eval against each:\n\n```toml\n# configs/endpoints.toml\n[[endpoint]]\nendpoint_id = \"laguna-baseline\"\nmodel = \"poolside/Laguna-XS.2\"\nurl = \"http://<baseline-vllm-host>:8000/v1\"\nkey = \"VLLM_API_KEY\"\ntype = \"openai_chat_completions\"\n\n[[endpoint]]\nendpoint_id = \"laguna-dflash\"\nmodel = \"poolside/Laguna-XS.2\"\nurl = \"http://<dflash-vllm-host>:8000/v1\"\nkey = \"VLLM_API_KEY\"\ntype = \"openai_chat_completions\"\n```\n\nThe DFlash server is launched with the speculator config:\n\n```bash\nVLLM_USE_DEEP_GEMM=0 vllm serve poolside/Laguna-XS.2 \\\n  --speculative-config '{\"model\":\"poolside/Laguna-XS.2-speculator.dflash\",\"num_speculative_tokens\":7,\"method\":\"dflash\"}'\n# vLLM >= 0.21.0, parsers poolside_v1; vLLM does NOT need --trust-remote-code.\n```\n\nThen:\n\n```bash\nprime eval run spec_rl -m laguna-baseline -n 20\nprime eval run spec_rl -m laguna-dflash   -n 20\n```\n\nIdentical reward, higher throughput on the DFlash run. Read realized acceptance\nlength (tau) and tokens/sec from the DFlash server's `/metrics` — these are\n**measured at the venue**, not quoted from any published figure.\n","encoding":"utf-8","truncated":false,"total_bytes":5363},"status":null}