{"data":{"kind":"file","path":"README.md","version_id":"v2rheicoebmwkugimd3hk284","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4886,"modified_at":"2025-11-07T08:39:00.207000","content_hash":"628ae1cde3547bfd02ae1fc0cfcfd30b414f054a03e9b445d7fbbd16110ee4f9"},"entries":[],"content":"# cuber-rl\n\n### Overview\n- **Environment ID**: `cuber-rl`\n- **Short description**: Multi-turn Rubik's cube solving environment with progressive reward shaping\n- **Tags**: rubiks-cube, multi-turn, puzzle-solving, reinforcement-learning\n\n### Datasets\n- **Primary dataset(s)**: Procedurally generated scrambled cubes with configurable difficulty\n- **Source links**: N/A (synthetic generation via `magiccube` and random scrambling)\n- **Split sizes**: 1000 episodes (default), fully procedural so can be extended arbitrarily\n\n### Task\n- **Type**: Multi-turn\n- **Parser**: XML tag parser (extracts moves from `<move>...</move>` tags)\n- **Rubric overview**: \n  - Solving reward: 1.0 for solving + efficiency bonus $\\min(1.0, \\frac{d}{t})$ where $d$ is initial distance and $t$ is turns used\n  - Progress reward: $\\frac{\\max(0, d - d')}{d}$ where $d'$ is distance after moves\n  - Format penalty: 0.0 for invalid responses\n\nThe agent receives a scrambled Rubik's cube and must provide sequences of moves in Singmaster notation (U, D, L, R, F, B with optional ', 2 modifiers) to solve it. Each turn allows up to `max_moves_per_turn` moves. Rewards are given for making progress toward the solved state and bonus rewards for efficiency when solving.\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval cuber-rl\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval cuber-rl \\\n  -m gpt-4.1-mini \\\n  -n 20 -r 3 -t 1024 -T 0.7 \\\n  -a '{\"scramble_ranges\": [[4, 8], [9, 14]], \"max_moves_per_turn\": 3, \"max_episode_steps\": 20}'\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n- Easier scrambles (1-3 moves) are good for initial testing; harder scrambles (9-14 moves) require more strategic solving.\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `scramble_ranges` | List[Tuple[int, int]] | `[[4, 8], [9, 14]]` | List of (min, max) move ranges for scrambling. One range is randomly selected per episode. |\n| `max_moves_per_turn` | int | `3` | Maximum number of moves the agent can execute per turn. |\n| `max_episode_steps` | int | `20` | Maximum number of turns before episode terminates (actual max turns is `ceil(max_episode_steps / max_moves_per_turn)`). |\n\n### Mechanics\n\n**State Representation**: Cubes are displayed as unfolded nets showing all 6 faces (U=Top/White, L=Left/Orange, F=Front/Green, R=Right/Red, B=Back/Blue, D=Bottom/Yellow).\n\n**Move Notation**: Standard Singmaster notation where single letters rotate clockwise 90°, apostrophe (') rotates counterclockwise 90°, and 2 rotates 180°.\n\n**Response Format**: Agent must wrap moves in XML tags: `<move>U R' F2 D</move>`. Empty tags `<move></move>` indicate no moves (useful when cube is already solved).\n\n**Reward Structure**:\n- **Solving**: 1.0 base + efficiency bonus of $\\min(1.0, \\frac{\\text{initial\\_distance}}{\\text{turns\\_used}})$ \n- **Progress**: $\\frac{\\text{distance\\_reduced}}{\\text{initial\\_distance}}$ for each turn (only positive progress counts)\n- **Invalid format**: 0.0 reward for that turn\n\n**Distance Metric**: Uses Kociemba's algorithm to compute optimal move count to solved state (cached for efficiency).\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Per-turn reward (progress or solving bonus) |\n| `total_reward` | Cumulative reward across entire episode |\n| `initial_dist` | Optimal moves needed from scrambled state |\n| `final_distance` | Moves remaining to solve at episode end |\n| `solved` | Boolean indicating if cube was solved |\n\n### Example Interaction\n\n```\nInitial State (5 moves from solved):\n        W W W\n        W R W\n        W W W\n        \nO O G   G G G   R O R   B B B\nO O O   G W G   R R R   B Y B\nO O O   G G G   R R R   B B B\n\n        Y Y Y\n        Y Y Y\n        Y Y O\n\nAgent: <move>U R' F</move>\nReward: 0.4 | Distance: 3\n```\n\n---\n\n### Evaluation Reports\n\n#### Model Performance (Pass@5, 10 puzzles episodes, Difficulty = 1 Move from Solved, No format reward)\n\n| Model | Avg Reward / 2.0 | Solves / 50 | Equiv. % |\n|-------|------------------|-------------|----------|\n| GPT-5 | 1.76 | 44 | 88% |\n| Claude Sonnet 4.5 | 0.60 | 15 | 30% |\n| Claude Opus 4 | 0.36 | 9 | 18% |\n| Gemini 2.5 Flash | 0.20 | 5 | 10% |\n| Kimi k2 | 0.04 | 1 | 2% |\n| Qwen-235B | 0.00 | 0 | 0% |\n\n#### Performance by Scramble Difficulty (No Format Reward)\n\n| Moves from Solved | GPT-5-nano | GPT-5-mini | GPT-5 |\n|-------------------|------------|------------|-------|\n| 1 | 0.920 | 1.120 | 1.200 |\n| 2 | 0.020 | 0.120 | 1.220 |\n| 3 | 0.033 | 0.000 | 1.125 |\n| 4 | 0.010 | 0.000 | 0.820 |\n| 5 | 0.000 | 0.000 | 0.500 |\n\nGPT-5 maintains strong performance across all difficulty levels, while GPT-5-mini and GPT-5-nano show sharp degradation beyond simple 1-2 move scrambles, suggesting limited spatial reasoning capabilities for multi-step cube solving.","encoding":"utf-8","truncated":false,"total_bytes":4886},"status":null}