{"data":{"kind":"file","path":"README.md","version_id":"b6ydalazki4ya9wdv45knn7v","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":1736,"modified_at":"2025-11-02T22:17:46.668000","content_hash":"35f29486227330f8a71830c954ff92f0d631cc98d906c76fabd49575dab1ae12"},"entries":[],"content":"# PicoCTF\n\n### Overview\n- **Environment ID**: `picoctf`\n- **Short description**: Verifier for PicoCTF challenges.\n- **Tags**: cybersecurity, ctf, multi-turn, eval\n\n### Datasets\n- **Primary dataset(s)**: [kyleavery/picoctf](https://huggingface.co/datasets/kyleavery/picoctf)\n- **Source links**: Original dataset from [HackSynth](https://github.com/aielte-research/HackSynth/blob/main/picoctf_bench/benchmark_solved.json) based on [PicoCTF](https://picoctf.org/).\n- **Split sizes**: 120 CTF challenges\n\n### Task\n- **Type**: multi-turn\n- **Parser**: ThinkParser when `use_think=True`, else a basic `Parser` extracting the final boxed answer (`extract_boxed_answer`)\n- **Rubric overview**: Compares the final predicted flag against the gold flag for exact match.\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval picoctf\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval picoctf -m gpt-4.1-mini -c 5 -n 20 -r 3 -t 1024 -T 0.7\n```\n\n### Environment Arguments\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `use_think` | bool | `true` | Whether to parse `<think>` tags |\n| `max_turns` | int | `10` | Maximum number of turns |\n\n### Metrics\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | The final reward achieved by the model |\n| `total_tool_calls` | Total number of tool calls made by the model |\n| `bash_calls` | Number of times the `bash` tool was called |\n| `python_calls` | Number of times the `python` tool was called |\n| `get_hint_calls` | Number of times the `get_hint` tool was called |\n| `flag_exact_match_reward` | 1.0 if the final predicted flag exactly matches the gold flag, else 0.0 |\n| `total_turns` | Total number of turns taken in the episode |\n","encoding":"utf-8","truncated":false,"total_bytes":1736},"status":null}