{"data":{"kind":"file","path":"README.md","version_id":"ame5rhyti9yhczaqkipc79b4","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":6417,"modified_at":"2025-09-08T23:29:47.447000","content_hash":"63a0a093ce57bc34cefb19c2f868789cd0b64f9070d32c48b548f3de58a41dfe"},"entries":[],"content":"# sudoku\n\n### Overview\n- **Environment ID**: `sudoku`\n- **Short description**: Multi-turn Sudoku puzzle solving environment with algorithmic generation and difficulty levels\n- **Tags**: multi-turn, reasoning, constraints, logic, train, eval, think\n\n### Datasets\n- **Primary dataset(s)**: Algorithmically generated Sudoku puzzles with configurable difficulty\n- **Source links**: Dynamic puzzle generation with unique solution validation\n- **Split sizes**: Configurable via `max_examples` parameter (default: 100 puzzles)\n\n### Task\n- **Type**: multi-turn (interactive solving)\n- **Parser**: `ThinkParser` (with thinking) or `Parser` (direct) with move extraction\n- **Rubric overview**: Progressive rewards for valid moves, completion bonuses, and efficiency scoring\n\n### Game Rules\nThe Sudoku environment presents models with:\n- **9x9 Grid**: Standard Sudoku board with algorithmically generated clues\n- **Difficulty Levels**: Easy (36-46 clues), Medium (28-35 clues), Hard (22-27 clues)\n- **Constraints**: Each row, column, and 3x3 box must contain digits 1-9 exactly once\n- **Single Move**: Make one move per turn in format `<move>A1=5</move>`\n- **Feedback**: Receive validation and board state after each move\n- **Goal**: Complete the puzzle with all constraints satisfied\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval sudoku\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval sudoku \\\n  -m gpt-4o-mini \\\n  -n 20 -r 3 -t 1024 -T 0.7 \\\n  -a '{\"max_examples\": 50, \"use_think\": true, \"max_turns\": 30}'\n```\n\nTest different configurations:\n\n```bash\n# Direct reasoning (no <think> tags)\nuv run vf-eval sudoku -a '{\"use_think\": false}'\n\n# Medium difficulty puzzles\nuv run vf-eval sudoku -a '{\"difficulty\": \"medium\"}'\n\n# Hard difficulty with more turns\nuv run vf-eval sudoku -a '{\"difficulty\": \"hard\", \"max_turns\": 75}'\n\n# Different random seed\nuv run vf-eval sudoku -a '{\"seed\": 123}'\n```\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `max_examples` | int | `100` | Number of Sudoku puzzles to include in dataset |\n| `use_think` | bool | `True` | Use thinking mode with `<think>` tags (`ThinkParser`) or direct reasoning (`Parser`) |\n| `max_turns` | int | `50` | Maximum number of moves allowed per puzzle |\n| `seed` | int | `42` | Random seed for reproducible puzzle selection |\n| `difficulty` | str | `\"easy\"` | Puzzle difficulty level: \"easy\" (36-46 clues), \"medium\" (28-35 clues), \"hard\" (22-27 clues) |\n\n### Scoring System\n\n**Progressive Reward System (Cumulative)**\n\nPer-move rewards:\n- **+0.1 points**: Each valid move\n- **-0.1 points**: Backtracking moves (clearing cells with 0)\n- **-0.2 points**: Invalid moves (constraint violations)\n\nFinal rewards:\n- **+5.0 points**: Successfully solving the puzzle\n- **+1.0 points**: Efficiency bonus based on turns used vs. max turns\n- **+1.0 points**: Completing puzzle (even if invalid)\n- **+2.0 points**: Partial completion based on percentage filled\n- **-0.1 points**: Penalty per constraint violation\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `sudoku_reward` | Cumulative reward based on moves and completion |\n\n### Example Interaction\n\n**Initial Puzzle:**\n```\nSudoku Puzzle:\n\n  1 2 3 | 4 5 6 | 7 8 9\nA 5 3 . | . 7 . | . . .\nB 6 . . | 1 9 5 | . . .\nC . 9 8 | . . . | . 6 .\n  ------+-------+------\nD 8 . . | . 6 . | . . 3\nE 4 . . | 8 . 3 | . . 1\nF 7 . . | . 2 . | . . 6\n  ------+-------+------\nG . 6 . | . . . | 2 8 .\nH . . . | 4 1 9 | . . 5\nI . . . | . 8 . | . 7 9\n\nRules:\n- Fill each empty cell (.) with a number from 1-9\n- Each row must contain all numbers 1-9 exactly once\n- Each column must contain all numbers 1-9 exactly once  \n- Each 3x3 box must contain all numbers 1-9 exactly once\n\nMake one move at a time using the format: <move>A1=5</move>\n```\n\n**Expected Response Format:**\n```\n<think>\nLooking at row A, column 3 (A3). \nRow A already has: 5, 3, 7\nRow A is missing: 1, 2, 4, 6, 8, 9\n\nColumn 3 already has: 8\nColumn 3 is missing: 1, 2, 3, 4, 5, 6, 7, 9\n\nBox 1 (top-left) already has: 5, 3, 6, 9, 8\nBox 1 is missing: 1, 2, 4, 7\n\nFor A3, the possible values are the intersection: 1, 2, 4\nNeed to check other constraints to narrow it down...\nLooking at other cells, A3 can only be 4.\n</think>\n\n<move>A3=4</move>\n```\n\n**Environment Response:**\n```\n✅ Valid move! 47 cells remaining.\n\nCurrent Board:\n  1 2 3 | 4 5 6 | 7 8 9\nA 5 3 4 | . 7 . | . . .\nB 6 . . | 1 9 5 | . . .\nC . 9 8 | . . . | . 6 .\n  ------+-------+------\nD 8 . . | . 6 . | . . 3\nE 4 . . | 8 . 3 | . . 1\nF 7 . . | . 2 . | . . 6\n  ------+-------+------\nG . 6 . | . . . | 2 8 .\nH . . . | 4 1 9 | . . 5\nI . . . | . 8 . | . 7 9\n\nProgress: 34/81 cells filled (42.0%)\nTurn: 1/50\n\nWhat's your next move?\n```\n\n### Move Format\n\n**Required Format**: `<move>RowColumn=Number</move>`\n\n- **Row**: A-I (A=top row, I=bottom row)\n- **Column**: 1-9 (1=leftmost, 9=rightmost)  \n- **Number**: 1-9 (digit to place)\n\n**Valid Examples**:\n- `<move>A1=5</move>` - Place 5 in top-left corner\n- `<move>E5=7</move>` - Place 7 in center cell\n- `<move>I9=3</move>` - Place 3 in bottom-right corner\n- `<move>A1=0</move>` - Clear cell A1 (backtrack)\n\n**Invalid Examples**:\n- `<move>A1 = 5</move>` - Extra spaces\n- `<move>A10=5</move>` - Column 10 doesn't exist\n- `<move>J1=5</move>` - Row J doesn't exist\n\n### Game Flow\n\n1. **Initialization**: Present puzzle with clues and empty cells\n2. **Turn Loop**: \n   - Model analyzes board and makes one move\n   - Environment validates move and updates board\n   - Environment provides feedback and updated board state\n   - Continue until solved, max turns reached, or error\n3. **Completion**: Calculate final reward based on success and efficiency\n\n### Features\n\n- **Algorithmic Generation**: Dynamic puzzle creation with unique solution validation\n- **Multiple Difficulty Levels**: Easy, medium, and hard puzzles with appropriate clue counts\n- **Backtracking Support**: Models can clear cells and try different approaches\n- **Penalty-Based Learning**: Backtracking incurs small penalty to encourage efficiency\n- **Constraint Validation**: Real-time checking of Sudoku rules\n- **Progressive Feedback**: Detailed messages for valid/invalid moves including backtracks\n- **Scalable Dataset**: Generate unlimited unique puzzles for training\n- **Efficiency Scoring**: Rewards for solving in fewer moves\n- **Error Recovery**: Continues game after invalid moves with feedback","encoding":"utf-8","truncated":false,"total_bytes":6417},"status":null}