{"data":{"kind":"file","path":"README.md","version_id":"au515bbtxz1aj35z71ea82y1","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":7035,"modified_at":"2026-01-06T19:01:40.933000","content_hash":"1e4b0554746a4401179b10da225132e3eb8993d2fc73b79a5f104a188cb570d6"},"entries":[],"content":"# frozen-lake\n\n### Overview\n- **Environment ID**: `frozen-lake`\n- **Short description**: Multi-turn RL environment where language models navigate a randomly generated FrozenLake grid from start to goal while avoiding holes\n- **Tags**: rl, navigation, grid-world, multi-turn, spatial-reasoning\n\n### Datasets\n- **Primary dataset(s)**: Synthetically generated FrozenLake scenarios using Gymnasium's random map generator\n- **Source links**: [Gymnasium FrozenLake-v1](https://gymnasium.farama.org/environments/toy_text/frozen_lake/)\n- **Split sizes**: Configurable via `num_scenarios` (default: 16 scenarios)\n\n### Task\n- **Type**: Multi-turn\n- **Parser**: Custom regex-based extraction for `>>> MOVE: <direction>` format\n- **Rubric overview**:\n  - `success_reward` (1.0 weight): Binary reward for reaching goal\n  - `progress_reward` (0.3 weight): Normalized Manhattan distance to goal\n  - `step_penalty` (0.1 weight): Efficiency penalty based on step count\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval -s frozen-lake\n```\n\nConfigure model and sampling:\n\n```bash\n# Basic configuration\nuv run vf-eval -s frozen-lake -m gpt-4.1-mini -n 20 -r 3\n\n# Enable neighbor info and move memory\nuv run vf-eval -s frozen-lake -m gpt-4.1 -n 10 -r 5 \\\n  -a '{\"provide_neighbor_info\": true, \"use_memory\": 5}'\n\n# Larger grid with chess notation\nuv run vf-eval -s frozen-lake -m gpt-4.1 -n 10 -r 3 \\\n  -a '{\"grid_size\": 10, \"use_chess_notation\": true}'\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object\n- Increase rollouts (`-r`) to test model consistency on the same scenarios\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `num_scenarios` | int | `16` | Number of unique FrozenLake scenarios to generate |\n| `grid_size` | int | `4` | Size of the square grid (e.g., 4 creates a 4x4 grid) |\n| `max_steps` | int | `10 * grid_size` | Maximum steps allowed before truncation |\n| `seed` | int | `42` | Random seed for reproducible map generation |\n| `provide_neighbor_info` | bool | `false` | Whether to show the value of neighboring cells in each direction |\n| `use_memory` | int or bool | `None` | Number of recent moves to display (use `5` for default, `None` to disable). Boolean `true` converts to `5` |\n| `use_chess_notation` | bool | `false` | Use spreadsheet-style notation (A1, B2, etc.) instead of array indices |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Weighted sum of all reward functions (success + progress - penalty) |\n| `success_reward` | Binary indicator (1.0 if goal reached, 0.0 otherwise) |\n| `progress_reward` | Normalized progress toward goal based on Manhattan distance (0.0-1.0) |\n| `step_penalty` | Efficiency penalty proportional to steps taken (negative value) |\n\n### Game Rules\n\nThe agent navigates a grid-based frozen lake with the following cells:\n- **S** (Start): Initial position\n- **F** (Frozen): Safe cells to walk on\n- **H** (Hole): Hazardous cells - stepping on one ends the episode with failure\n- **G** (Goal): Target cell at the bottom-right corner\n\n**Objective**: Navigate from start to goal using only frozen tiles, avoiding all holes.\n\n**Actions**: Four cardinal directions (LEFT, DOWN, RIGHT, UP)\n\n**Response Format**: Models must end each response with:\n```\n>>> MOVE: <direction>\n```\nwhere `<direction>` is one of: LEFT, DOWN, RIGHT, UP\n\n### Visualization\n\nThe environment includes powerful visualization tools for analyzing agent behavior:\n\n#### Rendering Episode GIFs\n\nGenerate animated GIFs showing agent trajectories:\n\n```bash\n# Render specific episodes with custom grid layout\npython render_gif.py <path-to-results.jsonl> --episodes 0 1 2 --max-cols 5\n\n# Render all episodes\npython render_gif.py <path-to-results.jsonl>\n\n# Custom frame duration\npython render_gif.py <path-to-results.jsonl> --duration 800 --episodes 5\n```\n\n**Output**: Creates animated GIFs in `<results_dir>/gifs/` showing:\n- Move-by-move visualization with overlay information\n- Step number, action taken, rewards\n- Success/failure indicators\n- Multiple rollouts displayed in grid layout\n- Chess notation labels (if enabled)\n\n#### Plotting Results\n\nAnalyze aggregate statistics across evaluations:\n\n```bash\npython plot_results.py <path-to-results.jsonl>\n```\n\n**Output**: Generates plots showing success rates, step counts, and reward distributions.\n\n#### Additional Utilities\n\n- `visualize_board.py`: Display individual board configurations\n- `inspect_render.py`: Examine Gymnasium rendering output\n- `find_failed_rollouts.py`: Identify and analyze failure cases\n\n### Features\n\n**Spatial Reasoning**: Models must interpret 2D grid layouts and plan paths through obstacles\n\n**Memory Support**: Optional move history helps models maintain directional awareness and avoid revisiting positions\n\n**Chess Notation**: Spreadsheet-style cell labels (A1, B2, etc.) can improve spatial communication for some models\n\n**Neighbor Information**: Optional feature explicitly lists adjacent cell values to reduce parsing errors\n\n**Retry Logic**: Automatic retry with exponential backoff for rate limits and server errors (5xx)\n\n**Format Validation**: Up to 3 retry attempts for invalid action formats with clear error messages\n\n**Reproducibility**: Seeded random map generation ensures consistent evaluation across runs\n\n### Dependencies\n\n- `verifiers>=0.1.2.post1` - Evaluation framework\n- `gymnasium>=0.29.0` - FrozenLake environment\n- `openai>=1.0.0` - LLM client\n- `tenacity>=8.0.0` - Retry logic\n- Additional visualization dependencies: `pillow`, `numpy` (included in verifiers)\n\n### Benchmark Results\n\nSelected results demonstrating environment performance with optimized configurations:\n\n| Model | Grid Size | Config | Success Rate | Avg Reward | Details |\n|-------|-----------|--------|--------------|------------|---------|\n| `moonshotai/kimi-k2-instruct-0905` | 10x10 | neighbor_info + memory(10) + chess_notation | **98%** | 1.278 | 10 examples × 10 rollouts |\n| `moonshotai/kimi-k2-instruct-0905` | 10x10 | neighbor_info + memory(5) + chess_notation | **97%** | 1.268 | 10 examples × 10 rollouts |\n\n**Key Findings**:\n- Enabling `provide_neighbor_info`, `use_chess_notation`, and `use_memory` dramatically improves success rates on larger grids\n- Memory size (5 vs 10) has minimal impact on performance for this model\n- Progress reward consistently near maximum (0.994-0.995), indicating efficient pathfinding\n- Near-perfect performance demonstrates environment is solvable with proper spatial reasoning capabilities\n\n### Notes\n\n- The environment uses `is_slippery=False` for deterministic movement (actions always succeed in moving the agent in the chosen direction if valid)\n- Map generation uses Gymnasium's `generate_random_map()` which guarantees a solvable path from start to goal\n- Move history is saved to `info['move_history']` in results for post-hoc visualization\n- All trajectories are logged with board state, positions, actions, and rewards for detailed analysis\n","encoding":"utf-8","truncated":false,"total_bytes":7035},"status":null}