{"data":{"kind":"file","path":"README.md","version_id":"lflqp1v18tfzbn4b22sxvo9t","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":1977,"modified_at":"2026-04-05T10:03:53.122000","content_hash":"0e4bf74de657acbfa6df49a91ea0e55e55106b3a958f4740911d174b0bcfc39c"},"entries":[],"content":"# tictactoe\n\n### Overview\n\n**For a detailed walkthrough of this environment and how to use it for training and evaluation, check out the [LLM RL Environments Lil Course](https://github.com/anakin87/llm-rl-environments-lil-course)**\n\n- **Environment ID**: `tictactoe`\n- **Short description**: Multi-turn Tic-Tac-Toe. The LLM plays as X against an opponent with configurable probability of moving randomly.\n- **Tags**: games, train, eval, tictactoe\n\n### Datasets\n- **Primary dataset(s)**: Procedurally generated game episodes\n\n### Task\n- **Type**: multi-turn\n- **Parser**: XMLParser\n- **Rubric overview**: Win/draw/loss reward, format compliance, invalid move penalty\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nprime eval run tictactoe\n```\n\nConfigure model and sampling:\n\n```bash\nprime eval run tictactoe   -m gpt-4.1-mini   -n 20 -r 3 -t 1024 -T 0.7   -a '{\"min_random_move_prob\": 0.0, \"max_random_move_prob\": 0.2}'\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `num_examples` | int | `1000` | Number of game episodes to generate |\n| `min_random_move_prob` | float | `0.0` | Minimum probability opponent moves randomly (vs optimal minimax) |\n| `max_random_move_prob` | float | `1.0` | Maximum probability opponent moves randomly (vs optimal minimax) |\n| `max_turns` | int | `8` | Maximum number of turns before forced termination |\n| `num_groups` | int | `1` | Number of stratified difficulty buckets. For stable training, set this equal to the number of unique prompts per batch (`batch_size // group_size`). |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `win_reward_func` | 1.0 for win, 0.5 for draw, 0.0 for loss/timeout |\n| `format_reward_func` | Parser-driven format compliance (weight: 0.2) |\n| `invalid_move_penalty_func` | -0.1 flat penalty if any invalid move occurred |\n","encoding":"utf-8","truncated":false,"total_bytes":1977},"status":null}