{"data":{"kind":"file","path":"README.md","version_id":"wwvd13harclyqkdlgalcrnst","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4145,"modified_at":"2025-11-29T02:04:44.100000","content_hash":"96fde1973bf03f6a1b916a61fca3155171872dff20ab606aaac6de54d7b328c2"},"entries":[],"content":"# cliff_walking\n\n### Overview\n- **Environment ID**: `cliff_walking`\n- **Short description**: Multi-turn Cliff Walking grid navigation game where an LLM agent must navigate from start to goal while avoiding the cliff.\n- **Tags**: games, multi-turn, navigation, reasoning, xml\n\n### Datasets\n- **Primary dataset(s)**: Self-generated episodes (no external dataset required)\n- **Source links**: Classic RL environment (OpenAI Gym / Gymnasium)\n- **Split sizes**: Number of episodes controlled via args\n\n### Task\n- **Type**: multi-turn (game interaction)\n- **Parser**: `XMLParser` with `action` field\n- **Rubric overview**: Goal completion reward, efficiency bonus, cliff penalty, and format check\n\n### Game Description\n\nThe agent navigates a 4x12 cliff walking grid:\n- **S**: Start position (bottom-left corner)\n- **G**: Goal (bottom-right corner - reach this to win)\n- **C**: Cliff (bottom row between S and G - AVOID!)\n- **o**: Safe cell (you can walk here safely)\n- **A**: Agent's current position\n\nThe agent can move in 4 directions: `UP`, `RIGHT`, `DOWN`, `LEFT`.\n\n**Rewards**:\n- Each step: **-1** (encourages finding the shortest path)\n- Falling off cliff: **-100** (and agent is sent back to start)\n- Reaching goal: **0** (episode ends)\n\n**The Challenge**: The shortest path is along the cliff edge (11 steps), but it's risky! The safest path goes up, across the top, and down (13 steps) but avoids any cliff danger.\n\n### Quickstart\n\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval cliff_walking\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval cliff_walking \\\n  -m gpt-4.1-mini \\\n  -n 20 -r 3 -t 1024 -T 0.7 \\\n  -a '{\"num_train_examples\": 1000, \"num_eval_examples\": 20, \"max_steps\": 100}'\n```\n\n### Environment Arguments\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `num_train_examples` | int | `1000` | Number of training episodes |\n| `num_eval_examples` | int | `20` | Number of evaluation episodes |\n| `max_steps` | int | `100` | Maximum steps per episode |\n\n### Metrics\n| Metric | Meaning |\n| ------ | ------- |\n| `_goal_reward_func` | 1.0 if agent reaches the goal, else 0.0 |\n| `_efficiency_reward_func` | Bonus (up to 0.5) for reaching goal efficiently (optimal is 13 steps) |\n| `_cliff_penalty_func` | Penalty (-0.1 per fall, max -0.5) for falling off the cliff |\n| `format_reward` | Adherence to expected XML format (weight 0.1) |\n\n### Example Interaction\n\n**System prompt** instructs the agent on rules and format.\n\n**Initial state**:\n```\n=== Cliff Walking ===\nGrid (A=Agent, S=Start, G=Goal, C=Cliff, o=Safe):\no o o o o o o o o o o o\no o o o o o o o o o o o\no o o o o o o o o o o o\nA C C C C C C C C C C G\n\nPosition: Row 3, Column 0\nSteps taken: 0\nTotal reward: 0\n=====================\n```\n\n**Agent response**:\n```\nI'm at position (3, 0) which is the start. The goal is at (3, 11).\nGoing right would put me on the cliff with -100 penalty.\nThe safer approach is to go up first, then traverse right across the top.\n<action>UP</action>\n```\n\n**Environment response**:\n```\nAction: UP (Reward: -1)\n\n=== Cliff Walking ===\nGrid (A=Agent, S=Start, G=Goal, C=Cliff, o=Safe):\no o o o o o o o o o o o\no o o o o o o o o o o o\nA o o o o o o o o o o o\nS C C C C C C C C C C G\n\nPosition: Row 2, Column 0\nSteps taken: 1\nTotal reward: -1\n=====================\n\nChoose your next action: UP, RIGHT, DOWN, or LEFT\n```\n\n### Grid Layout\n\nThe grid is 4 rows × 12 columns:\n\n```\no  o  o  o  o  o  o  o  o  o  o  o    (row 0)\no  o  o  o  o  o  o  o  o  o  o  o    (row 1)\no  o  o  o  o  o  o  o  o  o  o  o    (row 2)\nS  C  C  C  C  C  C  C  C  C  C  G    (row 3)\n```\n\n- **Start (S)**: Position (3, 0) - bottom-left\n- **Goal (G)**: Position (3, 11) - bottom-right\n- **Cliff (C)**: Positions (3, 1) through (3, 10) - the dangerous zone!\n\n### Optimal Paths\n\n1. **Safe Path (13 steps)**: UP → 11×RIGHT → DOWN\n   - Total reward: -13 (no cliff falls)\n\n2. **Risky Path (11 steps)**: 11×RIGHT (along the cliff edge)\n   - Total reward: -11 if successful, but any mistake = -100 penalty + restart\n\nThis environment is a classic example for comparing exploration strategies in reinforcement learning!\n","encoding":"utf-8","truncated":false,"total_bytes":4145},"status":null}