{"data":{"kind":"file","path":"README.md","version_id":"a51o77ci0wlref9ngrknb9i9","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":7318,"modified_at":"2025-12-04T08:05:21.450000","content_hash":"92b09c7b58c9e9f1bccb819582c5cd5fb5088c6fe9473e70cc7419287670a0ed"},"entries":[],"content":"# minigrid_adapted Adapter\n\n### Overview\n- **Environment ID**: `minigrid_adapted`\n- **Short description**: Multi-turn, multi-action verifier adapter for Farama MiniGrid with coordinate-based observations and game-specific milestone shaping rewards for GRPO training.\n- **Tags**: `rl`, `minigrid`, `gymnasium`, `verifiers`, `grpo`, `multi-action`\n\nThis environment wraps Farama [MiniGrid](https://minigrid.farama.org/) environments as verifiers-compatible tasks for [Prime RL](https://github.com/PrimeIntellect-ai/prime-rl). Key features:\n\n- **Multi-action generation**: Each LLM turn can specify up to 20 actions executed sequentially, enabling better temporal abstraction and credit assignment\n- **Coordinate-based observations**: Global (x, y) coordinates with origin at bottom-left, X→right, Y→up\n- **Game-specific milestone rewards**: Tailored shaping for LockedRoom, ObstructedMaze, and LavaGap environments\n- **Curriculum support**: Progressive difficulty ramping across environments\n\n### Datasets\n- **Primary dataset**: Synthetic prompts sampled from MiniGrid resets across supported environments\n- **Default environments**: `MiniGrid-ObstructedMaze-Full-v1`, `MiniGrid-LavaGapS7-v0`, `MiniGrid-LockedRoom-v0`\n- **Source links**: [MiniGrid Docs](https://minigrid.farama.org/)\n- **Split sizes**: Configurable via `num_examples` (default 12)\n\n### Task\n- **Type**: Multi-turn control loop with multi-action generation\n- **Parser**: `vf.XMLParser(fields=[\"think\", \"actions\"], answer_field=\"actions\")`\n- **Rubric overview**:\n  - `minigrid_native_reward_func` (weight 7.5): Native MiniGrid reward (0.1–1.0 for wins, efficiency-based)\n  - `minigrid_milestone_reward_func` (weight 0.5, optional): Game-specific shaping rewards\n  - Parser format reward (weight 0.2): Enforces `<THINK>`/`<ACTIONS>` XML structure\n\n### Multi-Action Format\n\nThe agent outputs multiple actions per turn in an `<ACTIONS>` block:\n\n```xml\n<THINK>Plan the sequence of actions</THINK>\n<ACTIONS>\nrotate(clockwise, 2)\nforward\nforward\npickup\ntoggle\n</ACTIONS>\n```\n\n**Available actions**:\n- `rotate(clockwise, N)` / `rotate(anticlockwise, N)`: Rotate N times (90° each)\n- `forward`: Move one tile in facing direction\n- `pickup`: Pick up object in front tile\n- `drop`: Drop held object in front tile\n- `toggle`: Open/close doors or boxes\n\n### Supported Environments\n\n| Environment | Objective | Milestone Rewards (max) |\n| ----------- | --------- | ----------------------- |\n| `MiniGrid-LockedRoom-v0` | Get specified key, unlock specified door, reach goal | 1.6 (mission-aware) |\n| `MiniGrid-ObstructedMaze-Full-v1` | Find and pick up blue ball in maze | 2.5 (box/key/door/ball) |\n| `MiniGrid-LavaGapS5-v0` | Navigate through lava gap to goal | Distance shaping + gap bonus |\n| `MiniGrid-LavaGapS6-v0` | Navigate through lava gap to goal | Distance shaping + gap bonus |\n| `MiniGrid-LavaGapS7-v0` | Navigate through lava gap to goal | Distance shaping + gap bonus |\n\n### Quickstart\n\n**Run with defaults** (all three default environments):\n\n```bash\nuv run vf-eval minigrid_adapted\n```\n\n**Run a single environment**:\n\n```bash\n# LavaGap only (easiest)\nuv run vf-eval minigrid_adapted -a '{\"env_ids\": [\"MiniGrid-LavaGapS7-v0\"]}'\n\n# LockedRoom only\nuv run vf-eval minigrid_adapted -a '{\"env_ids\": [\"MiniGrid-LockedRoom-v0\"]}'\n\n# ObstructedMaze only (hardest)\nuv run vf-eval minigrid_adapted -a '{\"env_ids\": [\"MiniGrid-ObstructedMaze-Full-v1\"]}'\n```\n\n**Mix and match environments**:\n\n```bash\n# Just the two simpler environments\nuv run vf-eval minigrid_adapted \\\n  -a '{\"env_ids\": [\"MiniGrid-LavaGapS7-v0\", \"MiniGrid-LockedRoom-v0\"]}'\n\n# All LavaGap sizes\nuv run vf-eval minigrid_adapted \\\n  -a '{\"env_ids\": [\"MiniGrid-LavaGapS5-v0\", \"MiniGrid-LavaGapS6-v0\", \"MiniGrid-LavaGapS7-v0\"]}'\n\n# Custom combination with milestone rewards\nuv run vf-eval minigrid_adapted \\\n  -m gpt-4.1-mini \\\n  -n 32 -r 3 \\\n  -a '{\n    \"env_ids\": [\"MiniGrid-LavaGapS5-v0\", \"MiniGrid-LockedRoom-v0\"],\n    \"use_milestone_rewards\": true\n  }'\n```\n\n**Progressive curriculum with LavaGap** (trains on easier sizes first, gradually adds harder):\n\n```bash\n# Start with S5 (5x5), progressively add S6, then S7 (7x7)\nuv run vf-eval minigrid_adapted \\\n  -m gpt-4.1-mini \\\n  -n 50 -r 5 -t 2048 -T 0.7 \\\n  -a '{\n    \"env_ids\": [\"MiniGrid-LavaGapS5-v0\", \"MiniGrid-LavaGapS6-v0\", \"MiniGrid-LavaGapS7-v0\"],\n    \"curriculum_mode\": \"progressive\",\n    \"curriculum_warmup_episodes\": 100,\n    \"use_milestone_rewards\": true\n  }'\n```\n\nIn progressive mode, `env_ids` order matters: the first environment is used exclusively at the start, then additional environments are gradually introduced over `curriculum_warmup_episodes`.\n\nNotes:\n- Use `-a/--env-args` for JSON kwargs forwarded to `load_environment()`\n- Reports land in `./environments/minigrid_adapted/reports/`\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `env_ids` | list[str] | `[\"MiniGrid-ObstructedMaze-Full-v1\", \"MiniGrid-LavaGapS7-v0\", \"MiniGrid-LockedRoom-v0\"]` | MiniGrid environment IDs. Order matters for progressive curriculum (easy→hard). |\n| `max_turns` | int | `20` | Maximum LLM generations before timeout. Total step budget = max_turns × 20. |\n| `num_examples` | int | `12` | Number of prompt seeds in the dataset. |\n| `seed` | int \\| null | `null` | RNG seed for reproducibility. |\n| `curriculum_mode` | str | `\"uniform\"` | `\"uniform\"` (random) or `\"progressive\"` (start with first env, gradually add more). |\n| `curriculum_warmup_episodes` | int | `100` | Episodes before all envs available (progressive mode only). |\n| `use_milestone_rewards` | bool | `false` | Enable game-specific milestone shaping. Recommended for smaller models (≤7B). |\n| `milestone_reward_weight` | float | `0.5` | Weight for milestone reward function. |\n| `native_reward_weight` | float | `7.5` | Weight for native MiniGrid reward. |\n\n### Observation Format\n\nEach observation includes:\n\n```\n<OBS id=N>\nGrid: WxH (origin at bottom-left, X→right, Y→up)\nYou: (x,y) facing direction, holding [item] or hands empty\nIn front (x,y): [object description] or empty or BLOCKED\nMission: [mission text if applicable]\nVisible objects:\n  - [color] [type] at (x,y)\n  - ...\n</OBS>\n```\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `native_reward` | MiniGrid's built-in reward (0.1–1.0 for wins, higher = faster completion) |\n| `milestone_reward_total` | Cumulative game-specific shaping reward |\n| `result` | Final outcome: `success`, `timeout`, `failure`, or `invalid` |\n| `steps` | Total MiniGrid environment steps executed |\n| `turns` | Total LLM generations used |\n\n### Milestone Reward Details\n\n**LockedRoom** (max 1.6):\n- Enter key room: +0.4\n- Pick up correct key: +0.6\n- Unlock correct door: +0.6\n\n**ObstructedMaze** (max 2.5):\n- Open first box: +0.5\n- Pick up first key: +0.5\n- Open first door: +0.5\n- Pick up blue ball (goal): +1.0\n\n**LavaGap** (potential-based):\n- Distance reduction to goal: +0.15 per Manhattan distance\n- Distance increase: -0.1 (mild backtrack penalty)\n- Pass through lava gap: +0.4\n- Exploration: +0.03 per new tile\n\n## Evaluation Reports\n\n<!-- Do not edit below this line. Content is auto-generated. -->\n<!-- vf:begin:reports -->\n<p>No reports found. Run <code>uv run vf-eval minigrid_adapted</code> to generate one.</p>\n<!-- vf:end:reports -->\n","encoding":"utf-8","truncated":false,"total_bytes":7318},"status":null}