{"data":{"kind":"file","path":"README.md","version_id":"os3cifymhi3xhmqlu8g77g0g","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":1821,"modified_at":"2026-01-26T00:55:17.401000","content_hash":"c58d8b22c7c84d0f9c4c561b1963c7f959d621e3e659c88a4f4e16b3f4815a69"},"entries":[],"content":"# web3-sec-bench\n\n### Overview\n\n- **Environment ID**: `web3-sec-bench`\n- **Short description**: Multi-turn environment for evaluating AI agents on smart contract security CTF challenges\n- **Tags**: security, web3, solidity, ctf, eval, multi-turn\n\n### Challenge Sources\n\n30 Solidity CTF challenges designed to be compatible with [Harbor](https://github.com/laude-institute/harbor). Full repo: [web3-sec-bench](https://github.com/0xToshii/web3-sec-bench).\n\n### Task Format\n\nEach task is a Foundry-based Solidity CTF challenge. The agent must:\n1. Analyze vulnerable contracts and test file\n2. Identify the vulnerability\n3. Write exploit code producing a `solution.txt` file\n\nExpected `solution.txt` format:\n```\n<imports>\nimport {Exploiter} from \"src/Exploiter.sol\";\n</imports>\n\n<solution>\nExploiter exploiter = new Exploiter();\nexploiter.attack(address(target));\n</solution>\n```\n\n### Quickstart\n\nRun an evaluation with default settings (downloads challenges automatically):\n\n```bash\nuv run vf-eval web3-sec-bench\n```\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `challenges_dir` | str | `None` | Local path to challenges. If not provided, downloads from default repo. |\n| `max_turns` | int | `100` | Maximum agent turns per challenge |\n| `auto_build` | bool | `True` | Auto-build Docker images if not found |\n| `agent_timeout_sec` | float | `900.0` | Timeout in seconds (15 min default) |\n\n### Tools\n\nThe agent has access to:\n- `bash(command)` - Execute shell commands in the container\n- `file_edit(command, path, ...)` - View/create/edit files ('view', 'create', 'str_replace')\n- `submit()` - Signal task completion, triggers evaluation\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | 1.0 if challenge solved (tests pass), 0.0 otherwise |\n","encoding":"utf-8","truncated":false,"total_bytes":1821},"status":null}