{"data":{"kind":"file","path":"README.md","version_id":"z4bgw8l97ptr9ysitmasumob","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4776,"modified_at":"2026-06-17T05:40:00.125000","content_hash":"1e2262f74582500bb02307fdf3374dc4142e0c5689d488d3edb17f6ab0b43ea9"},"entries":[],"content":"# contractbench\n\n### Overview\n- **Environment ID**: `contractbench`\n- **Short description**: Runs ContractBench contract-reliability tasks through Harbor in Prime Intellect sandboxes, scored by a deterministic per-task contract verifier with a paper-defined severity-graded reward.\n- **Tags**: harbor, contractbench, cli_agent, reliability\n\n### Provenance\n- **Tasks**: [SecurityLab-UCD/ContractBench](https://github.com/SecurityLab-UCD/ContractBench) — `harbor/cookbook-export/<task>/` (33 Harbor-native tasks, `task.toml` schema 1.3).\n- **Paper**: ContractBench (arXiv:2605.17281).\n- **Harbor**: tasks run under [laude-institute/harbor](https://github.com/laude-institute/harbor) (task config schema 1.3); the Harbor Terminus agent loop is reused, identical to the `terminus_harbor` environment.\n\n### Datasets\n- **Primary dataset(s)**: ContractBench cookbook-export tasks bundled under `tasks/`.\n- **Bundled subset**: `scheduled-maintenance`, `csrf-form-submit` (for verification). The environment is designed so all 33 tasks work — drop them into `tasks/` (or point `dataset_path` at a full checkout) and they load unchanged.\n\n### Task\n- **Type**: multiturn, cli_agent\n- Each task ships a FastAPI \"contract server\" (built from `environment/Dockerfile`) that enforces a temporal/integrity contract (token validity windows, byte-exact tokens, CSRF tokens, rate limits, etc.) and logs a per-request **failure label**. The agent must satisfy the contract and write the required output file; a pytest verifier (`tests/test.sh`) checks the result.\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nprime eval run contractbench\n```\n\nOr with the verifiers CLI:\n\n```bash\nvf-eval contractbench -m openai/gpt-4.1-mini -n 2 -r 3\n```\n\nRun a specific subset / model:\n\n```bash\nprime eval run contractbench \\\n  -m openai/gpt-4.1-mini -n 2 -r 3 \\\n  -a '{\"tasks\": [\"scheduled-maintenance\", \"csrf-form-submit\"]}'\n```\n\n### Environment Arguments\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `tasks` | list[str] \\| null | `null` | Allowlist of task names; `null` runs all bundled tasks |\n| `dataset_path` | str | `tasks/` | Directory of Harbor tasks |\n| `docker_image` | str | `python:3.11-slim` | Base sandbox image; the contract server is layered on top |\n| `agent_name` | str | `terminus-2` | Harbor agent |\n| `model_name` | str | `openai/gpt-4.1-mini` | Model identifier |\n| `server_port` | int | `8080` | Port the ContractBench FastAPI server listens on |\n| `severity_reward` | bool | `true` | Map failure labels to the paper severity table; `false` returns raw binary reward |\n\n### Metrics\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Main scalar reward: binary verifier pass (`+1.0`) or, on failure, the severity-graded value below |\n\n## How It Works\n1. Creates a Prime sandbox per task from `docker_image`.\n2. In `post_sandbox_setup`, replicates the task's `environment/Dockerfile`: installs `curl`/`git`/`uv` (Harbor agent toolchain) and `fastapi`/`uvicorn`/`python-multipart`, uploads `server.py` + `shared/`, and launches `uvicorn server:app` on `:8080`. The contract server is healthchecked before the agent runs.\n3. Runs the Harbor Terminus agent loop on the task `instruction.md`, intercepting model API calls through the Prime tunnel.\n4. After the agent finishes, uploads `solution/` + `tests/` and runs `tests/test.sh` to compute the reward.\n\n## Reward\nThe contract verifier is fully deterministic.\n\n- **Base (binary)**: `tests/test.sh` writes `/logs/verifier/reward.txt` (`1.0` pass / `0.0` fail) and a rich `/logs/verifier/reward.diagnostics.json` carrying the per-episode failure-label taxonomy (`failure_labels`, `server_log_summary`).\n- **Severity layer** (`severity_reward=true`): on a non-pass, the dominant (most severe) failure label is read from the diagnostics file and mapped to the paper severity table:\n\n| Label | Reward |\n| ----- | ------ |\n| `SUCCESS` | `+1.0` |\n| `EXPIRED_BEFORE_USE` | `-1.0` |\n| `MUTATED_TOKEN` | `-1.0` |\n| `SHORTCUT_TAKEN` | `-0.9` |\n| `WRONG_VALUE` | `-0.8` |\n| `MISSING_CONSTRAINT` | `-0.7` |\n| `RATE_LIMITED` | `-0.5` |\n| `OTHER` (any other / unlabeled failure) | `-0.3` |\n\n## Requirements\n- **Prime Intellect sandbox infrastructure** (`prime-sandboxes`) — `HarborEnv` creates remote sandboxes and routes model calls through the Prime tunnel, so a Prime account / API key is required to run a real rollout (same as `terminus_harbor`).\n- Tasks need outbound network during sandbox setup (`apt`, `pip`, `uv`, Harbor git install).\n\n## Publishing\n```bash\nprime env push   # from environments/contractbench/, pushes to the Environments Hub\n```\nAfter push it appears on the Prime Intellect Environments Hub as `contractbench` and is runnable via `prime eval run contractbench`.\n","encoding":"utf-8","truncated":false,"total_bytes":4776},"status":null}