{"data":{"kind":"file","path":"README.md","version_id":"q7x9wgiytjskrza4n7hpz2e6","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":1479,"modified_at":"2025-11-23T20:04:40.683000","content_hash":"fa51cd919ce3dec8bdb5af65131cecd7e89c50b976658a0d634f12521b147fd6"},"entries":[],"content":"# OverTheWire\n\n### Overview\n- **Environment ID**: `overthewire`\n- **Short description**: Verifier for OverTheWire via SSH.\n- **Tags**: cybersecurity, ctf, multi-turn, eval\n\n### Datasets\n- **Primary dataset(s)**: [kyleavery/overthewire](https://huggingface.co/datasets/kyleavery/overthewire)\n- **Source links**: Original dataset from [HackSynth](https://github.com/aielte-research/HackSynth) based on [OverTheWire](https://overthewire.org/wargames/).\n- **Split sizes**: 46 challenges\n\n### Task\n- **Type**: multi-turn\n- **Parser**: ThinkParser when `use_think=True`, else a basic `Parser`.\n- **Rubric overview**: Checks if the gold password is contained within the model's parsed answer.\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval overthewire\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval overthewire -m gpt-4.1-mini -c 5 -n 20 -r 3 -t 1024 -T 0.7\n```\n\n### Environment Arguments\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `use_think` | bool | `false` | Whether to parse `<think>` tags |\n| `max_turns` | int | `10` | Maximum number of turns |\n| `game_filter` | str | `null` | Game to filter for (e.g., \"bandit\"). If null, loads all. |\n\n### Metrics\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | The final reward achieved by the model |\n| `bash_calls` | Number of times the `bash` tool was called |\n| `exact_match_reward` | 1.0 if the predicted answer contains the gold flag, else 0.0 |\n","encoding":"utf-8","truncated":false,"total_bytes":1479},"status":null}