{"data":{"kind":"file","path":"README.md","version_id":"uo55ijqcww9bx3m486o0t0kz","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":5026,"modified_at":"2026-06-14T12:16:58.027000","content_hash":"8fb44b204866b221024575ca5497b361eaee71f0ed56f152751a100f62e7c2cf"},"entries":[],"content":"# sre-incident-bench\n\nA long-horizon SRE incident-response environment for\n[`verifiers`](https://github.com/PrimeIntellect-ai/verifiers). The agent is paged as\non-call SRE to a broken multi-service stack and must diagnose and fix the incident using\nshell tools. Rewards are programmatic, verifiable, and hack-resistant.\n\n```bash\npip install -e .\nvf-eval sre-incident-bench -m <model> --provider <provider>\n```\n\n## What a rollout looks like\n\n```\nSystem: You are an on-call SRE with shell access to an ops jumpbox...\nUser:   On-call page: checkout started failing under load after last night's deploy...\nAgent:  [bash] curl -s http://api:8000/healthz           -> {\"status\":\"ok\",\"orders\":0}\nAgent:  [bash] curl -s -o /dev/null -w '%{http_code}' -X POST http://api:8000/checkout\n                                                          -> 503\nAgent:  [read_file] /ops/app/config.yaml                  -> db_pool_max: 2 ...\nAgent:  [write_file] /ops/app/config.yaml                 -> db_pool_max: 20\nAgent:  [restart_service] api                             -> healthy=True\nAgent:  [bash] for i in $(seq 10); do curl ... done       -> 200 200 200 200 ...\nAgent:  DONE. Root cause: db_pool_max was set to 2...\n\nGrader: checkout_200=True pool_restored=True cert_valid=True\n        tests_not_deleted=True health_not_faked=True\n        Score: 1.0\n```\n\n## Stack\n\nA sandboxed 4-service stack, hermetic and parallelizable:\n\n| Service  | Role                     | How the agent reaches it                |\n|----------|--------------------------|-----------------------------------------|\n| **api**  | FastAPI app (healthz, checkout) | `curl http://api:8000/...`         |\n| **db**   | Postgres 16              | via the api's connection pool            |\n| **nginx**| TLS reverse proxy        | `curl https://nginx/...`, `openssl ...`  |\n| **worker**| Background job processor | `service_status`, `service_logs worker`  |\n\nEach rollout boots a fresh copy (~6 s). No state leaks between rollouts.\n\n## Tools\n\n| Tool | Purpose |\n|------|---------|\n| `bash(cmd)` | Shell on the ops jumpbox: curl, openssl, psql, sed, grep. No docker CLI. |\n| `read_file(path)` / `write_file(path, content)` | Inspect and edit bind-mounted service files under `/ops`. |\n| `restart_service(service)` | Restart a service to pick up a config/code change. |\n| `service_logs(service)` | Tail recent logs for a service. |\n| `service_status()` | Health and status of all services. |\n\nThe agent's jumpbox has **no docker socket** and cannot see other rollouts or the grader.\nService control is host-mediated; the grader observes from a fresh, agent-untouched\ncontainer — never through the agent's box.\n\n## Reward\n\nMilestone-weighted, in `[0, 1]`. Each root cause contributes a milestone, earned only when:\n- the **behavioral check** passes (e.g. `POST /checkout` returns 200 under concurrent load), AND\n- the **root-cause config** is restored (not just the symptom masked).\n\nAny **trajectory invariant violation** zeroes the entire score:\n- `tests_not_deleted` — hash-based integrity check on protected source files.\n- `endpoint_not_stubbed` — an active differential probe that stops the DB and re-probes\n  the API from a fresh container. A real endpoint flips to unhealthy; a hardcoded stub\n  stays 200. This defeats stubbing in a way no static scan can.\n\nResolve the incident legitimately → `1.0`. Game the grader → `0.0`.\n\n## Difficulty\n\nInstances are difficulty-laddered by fault count and composition:\n\n| Band   | Faults | Description |\n|--------|--------|-------------|\n| easy   | 1      | Single root cause with a clear symptom. |\n| medium | 2      | Two independent root causes; both must be fixed. |\n| hard   | 1–4    | Diagnosis-hard (DB runtime investigation, not config-file-readable) or composed faults with a masking chain (upstream must be fixed first). |\n\nThis public sample ships 15 curated instances across all three bands (4 easy, 4 medium,\n6 hard), plus a healthy-stack harness anchor that boots clean and grades `1.0`.\n\n**Calibration (June 2026).** Config-file faults are solved reliably by frontier models.\nThe hard-diagnosis tier (DB runtime investigation) is a different class — one current\nfrontier model (DeepSeek V3.2) correctly diagnosed the root cause but chose to mask the\nsymptom instead of fixing it, and the grader caught the masking (score 0.0).\n\n| Model | Config task (2 faults) | Hard task (DB runtime) |\n|---|---|---|\n| Gemini 3.5 Flash | 1.0 (33 steps) | 1.0 (42 steps) |\n| DeepSeek V3.2 | 1.0 (35 steps) | **0.0** (25 steps) — found cause, masked symptom |\n| Kimi K2.5 | 1.0 (18 steps) | 1.0 (25 steps) |\n\nThe procedural generator (not shipped) produces unlimited additional instances with\nrandomized parameters.\n\n## Prerequisites\n\n- A Docker daemon (e.g. `colima start` on macOS, native Docker on Linux).\n- Base images: `lhrl/web_api_pg-app:base` and `lhrl/ops:base`. On a fresh machine, build\n  them once with the development repo's `scripts/build_base.py web_api_pg`.\n- Python 3.10+, `verifiers >= 0.1.14`.\n","encoding":"utf-8","truncated":false,"total_bytes":5026},"status":null}