{"data":{"kind":"file","path":"README.md","version_id":"svypawmftqktblwbllxh83bn","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4389,"modified_at":"2026-03-14T07:11:04.365000","content_hash":"9c0e4a5663061cd928a71d65f69c90733dfccd45714fb4eded062bc0932af107"},"entries":[],"content":"# tic-tac-toe\n\nA multi-turn tic-tac-toe environment where the model plays against a perfect (minimax) opponent.\n\n## How It Works\n\n### Game Setup\nEach game is randomized along two axes:\n- **Symbol assignment**: the model is randomly assigned X or O\n- **First mover**: randomly decided whether the model or the environment moves first\n\nThis means the model must learn to play well in all four configurations: X first, X second, O first, O second.\n\n### Board Format\nThe board is a numbered 3×3 grid. Empty cells show their number (1–9), occupied cells show X or O:\n```\n 1 | X | 3\n-----------\n O | 5 | 6\n-----------\n 7 | 8 | 9\n```\n\n### Turn Flow\n1. The environment sends the board state and prompts the model\n2. The model responds with a move (must include a cell number 1–9)\n3. The environment parses the move, updates the board, and plays its counter-move\n4. Repeat until win, loss, draw, or illegal move\n\n### System Prompt\nThe model receives a system message instructing it to always include a cell number in its response. It may reason freely about its move, but the number must be present.\n\n### Move Parsing\nThe parser extracts the model's chosen cell from its response using three strategies in priority order:\n1. **Keyword patterns** — matches phrases like \"play 5\", \"choose 5\", \"position 5\", \"take 5\", \"cell 5\", \"square 5\"\n2. **Standalone digit** — a bare digit (1–9) on its own line\n3. **First digit** — the first digit 1–9 found in the first 80 characters of the response\n\nIf no valid digit is found, or the chosen cell is already occupied, the move is ruled illegal and the game ends.\n\n### Environment Opponent\n- **Opening move** (when the environment goes first): randomly selected from cells 1–9 (all cells are equally optimal on an empty board)\n- **All subsequent moves**: selected by a minimax engine that plays perfectly. When multiple moves tie for the best score, one is chosen at random\n\nSince tic-tac-toe is a solved game, perfect play always results in a draw. The model cannot beat the minimax opponent — the best achievable outcome is a consistent draw.\n\n### Message Format\n\n**When the model goes first** (empty board):\n```\nLet's play tic-tac-toe. You are X.\n\n 1 | 2 | 3\n-----------\n 4 | 5 | 6\n-----------\n 7 | 8 | 9\n\nYou go first.\n```\n\n**When the environment goes first** (one move already placed):\n```\nLet's play tic-tac-toe. You are O.\n\nI play 5.\n\n 1 | 2 | 3\n-----------\n 4 | X | 6\n-----------\n 7 | 8 | 9\n\nYour turn.\n```\n\n**Environment responses during play** narrate the opponent's move and show the updated board:\n```\nI play 3.\n\n X | 2 | O\n-----------\n 4 | X | 6\n-----------\n 7 | 8 | 9\n\nYour turn.\n```\n\n**Terminal states** end with one of: `You win!`, `You lose!`, `Draw!`, or `Illegal move. Game over.`\n\n## Rubric\n\n| Outcome | Reward |\n|---------|--------|\n| Win | 1.0 |\n| Draw | 0.5 |\n| Loss | 0.0 |\n| Illegal move | 0.0 |\n\n## Dataset\n\n1,000 procedurally generated seeds. The actual game variety comes from the randomized setup (symbol + first mover + opponent randomness) and the model's own play. Effectively infinite replayability.\n\n## What to Expect During Training\n\n- **Early**: frequent illegal moves and quick losses → reward near 0\n- **Mid**: model learns legal moves and starts blocking threats → reward climbs\n- **Converged**: consistent draws against the perfect opponent → reward stabilizes around 0.5\n- **If reward exceeds 0.5**: something is wrong — you cannot beat minimax at tic-tac-toe\n\n## Baseline Results\n\n| Model | Games | Draws | Losses | Illegal | Errors | Avg Reward |\n|---|---|---|---|---|---|---|\n| Qwen3-30B-A3B Instruct | 20 | 3 | 17 | 0 | 0 | 0.075 |\n| Qwen3-8B (thinking) | 20 | 12 | 8 | 0 | 0 | 0.300 |\n| Trinity-mini (~3B) | 18 | 0 | 18 | 0 | 0 | 0.000 |\n\nNote: thinking models (Qwen3-8B) may produce empty `content` fields, causing parse failures. Instruct models are recommended.\n\n## Quickstart\n\n```bash\n# Install\nprime env install tars90percent/tic-tac-toe\n\n# Evaluate a model\nprime eval run tars90percent/tic-tac-toe -m qwen/qwen3-30b-a3b-instruct-2507 -n 20 -r 1\n\n# Train\nprime rl run configs/rl/tic-tac-toe.toml\n```\n\n## Example Training Config\n\n```toml\nmodel = \"Qwen/Qwen3-4B-Instruct-2507\"\nmax_steps = 100\nbatch_size = 128\nrollouts_per_example = 16\n\n[sampling]\nmax_tokens = 64\n\n[[env]]\nid = \"tars90percent/tic-tac-toe\"\n```\n\nShort `max_tokens` since moves are just a number with optional reasoning.\n","encoding":"utf-8","truncated":false,"total_bytes":4389},"status":null}