{"data":{"kind":"file","path":"README.md","version_id":"oldxma120f0et33nsbqicf4q","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":6008,"modified_at":"2026-06-17T20:30:52.004000","content_hash":"4a145c75d7cea8cb08c6f673dbcd8b8d94b289f4ce0a330360055032c4973491"},"entries":[],"content":"# concurrency-bench\n\nAn evaluation environment for **implementing stateful concurrency / distributed-systems\nprimitives** in Python from a precise written contract. Each task hands the model a documented stub\nand a behavioral contract; the model must return a complete, correct implementation. Scoring is\n**gated partial credit** over three hidden test tiers — not binary pass/fail — which separates a\nshallow \"happy-path\" solution from one that handles boundaries, monotonic-clock discipline, and\nexact semantics.\n\n## Why this environment\n\nRate limiters, circuit breakers, retry controllers, idempotency caches, quota trackers, lease\npools — these are the small stateful components every backend depends on, and they are exactly the\ncode where off-by-one boundaries, clock misuse, and missing hysteresis cause real production\nincidents. Most coding evals grade such tasks with a single pass/fail signal; concurrency-bench\ngrades the *quality* of the implementation across difficulty tiers, giving a smoother training/eval\nsignal.\n\n## Tasks (10)\n\n| id | primitive | what it exercises |\n|----|-----------|-------------------|\n| 01_token_bucket | token-bucket limiter | continuous refill, burst cap, fractional tokens |\n| 02_sliding_window | sliding-window limiter | timestamp eviction, strict window boundary |\n| 03_circuit_breaker | circuit breaker | closed/open/half-open, lazy timeout transition |\n| 04_retry_admission | retry controller | exponential backoff, cap, exhaustion ordering |\n| 05_idempotency_cache | TTL dedup cache | lazy expiry, overwrite-refresh, boundary |\n| 06_leaky_bucket | leaky-bucket limiter | steady drain, clamp at zero, exact capacity |\n| 07_bounded_queue | backpressure queue | watermark **hysteresis** (direction-dependent) |\n| 08_quota_tracker | multi-key quota | origin-aligned fixed windows, key independence |\n| 09_heartbeat_monitor | liveness monitor | expiry boundary, revival, remembered dead nodes |\n| 10_lease_manager | lease pool | bounded concurrency + TTL auto-reclaim |\n\nAll time-based tasks take an injected `now()` clock, so grading is fully deterministic — using\nwall-clock time inside the implementation is a graded failure.\n\n## Scoring\n\nEach task ships three hidden test tiers:\n\n- **pass_to_pass** (contract/API sanity) — a **gate**: every test must pass to earn any credit.\n- **fail_to_pass** (core behaviors) — weight `0.5`.\n- **edge** (boundaries, clock discipline, overflow, hysteresis) — weight `0.3`.\n\n```\nreward = 0.0                                  if any pass_to_pass test fails\n       = 0.2 + 0.5 * f2p_frac + 0.3 * edge_frac   otherwise        # range [0.2, 1.0]\n```\n\nThe `0.2` floor credits a working-but-shallow implementation; the `edge` tier is where strong\nmodels separate from weak ones. The environment also reports `gate_passed`, `f2p_fraction`, and\n`edge_fraction` as per-rollout metrics.\n\n## Usage\n\n```python\nimport verifiers as vf\nfrom openai import AsyncOpenAI\n\nenv = vf.load_environment(\"concurrency-bench\")            # all 10 tasks\n# env = vf.load_environment(\"concurrency-bench\", task_ids=[\"01_token_bucket\", \"03_circuit_breaker\"])\n\nclient = AsyncOpenAI()\nresults = env.evaluate(client=client, model=\"gpt-4.1-mini\", rollouts_per_example=1)\n```\n\nOr from the CLI:\n\n```bash\n# Any OpenAI-compatible endpoint works; grading is local and provider-agnostic.\nuv run vf-eval concurrency-bench \\\n  -m llama-3.3-70b-versatile \\\n  -b https://api.groq.com/openai/v1 -k GROQ_API_KEY \\\n  --api-client-type openai_chat_completions \\\n  -n 10 -r 1 --max-tokens 4096\n```\n\n## Calibration\n\nBaseline runs (all 10 tasks, 1 rollout each, `temperature` default), graded by the gated\npartial-credit rubric. Four models spanning a wide capability range:\n\n| Model | Size | Avg reward | Gate pass rate |\n|-------|------|-----------:|:--------------:|\n| `llama-3.1-8b-instant` | 8B | **0.54** | 6/10 |\n| `meta-llama/llama-4-scout-17b` | 17B MoE | **0.89** | 9/10 |\n| `llama-3.3-70b-versatile` | 70B | **0.98** | 10/10 |\n| `openai/gpt-oss-120b` | 120B | **1.00** | 10/10 |\n\nThe reward rises smoothly and monotonically with model capability (0.54 → 0.89 → 0.98 → 1.00),\nwhich is exactly the discrimination an eval should provide. The **gate** does real work at the\nlow end — the 8B model produces 4 implementations that don't even satisfy the API contract\n(reward 0.0), the 17B model 1 — while the **edge** tier separates the stronger models on the\ntasks they all nominally pass.\n\nPer-task reward, `llama-3.1-8b-instant` (01→10):\n`1.0, 0.82, 0.94, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.66` — the four zeros are gate failures on\n`idempotency_cache`, `leaky_bucket`, `bounded_queue`, and `quota_tracker`.\n\n**Saturation note:** `gpt-oss-120b` clears every tier (1.00), so the current edge tier is not\nhard enough to separate *frontier-class* models from each other. The next iteration hardens the\nedge tier (tighter boundary/clock-discipline traps, more direction-dependent hysteresis) so the\nstrongest models land below the ceiling and the env yields useful gradient as a training signal.\n\n*(Runs produced via Groq's OpenAI-compatible endpoint; grading is local and provider-agnostic.)*\n\n## Design notes\n\n- **SingleTurnEnv**: the model returns the full module in one code block; the rubric writes it\n  beside the hidden tests and runs each tier in an isolated subprocess (60s timeout per tier).\n- **Reproducible**: deterministic clocks + subprocess grading mean scores are stable across runs\n  and machines; no remote sandbox or network is required to evaluate.\n- **Calibration target**: a weak model should usually clear the gate but stall on `edge`; a strong\n  model should score high but rarely perfect. (As a rule of thumb, ~90%+ for a small model means a\n  task is too easy; ~0% for a frontier model means it is too hard or broken.)\n\n## Roadmap\n\nAn agentic variant (`SandboxEnv`, multi-turn: edit files + run the public tests in a Prime\nsandbox before submitting) and a scale-up to 50+ primitives are planned for a training-grade\nrelease.\n","encoding":"utf-8","truncated":false,"total_bytes":6008},"status":null}