{"data":{"kind":"file","path":"README.md","version_id":"o9k5rsk83efmrhzqk6l8g6v3","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3760,"modified_at":"2025-10-02T06:16:27.911000","content_hash":"9f4a0dde4e6d866d2bf0009f2908b36a9bd4cae407e0b8755a6d8557eb1115c7"},"entries":[],"content":"# boxoban\n\n### Overview\n- **Environment ID**: `boxoban`\n- **Short description**: Interactive Sokoban (DeepMind Boxoban levels) as a MultiTurn Verifiers environment.\n- **Tags**: rl, planning, sokoban, boxoban, multiturn, evaluation\n\n### Datasets\n- **Primary dataset**: DeepMind Boxoban Levels (10x10 puzzles; 4 boxes/goals)\n- **Source**: https://github.com/deepmind/boxoban-levels\n- **Splits**:\n  - `unfiltered`: train 900k, valid 100k, test 1k\n  - `medium`: train 450k, valid 50k\n  - `hard`: 3332 total\n\nThe environment auto-downloads the selected split(s) on first use and caches under `data/boxoban-levels/` in this package.\n\n### Task\n- **Type**: multi-turn\n- **Parser**: `ThinkParser` — agent may think in `<think>...</think>`; must output an action letter inside `<final>...</final>`\n- **Actions**: one per turn from `{U,D,L,R}` (up/down/left/right)\n- **Objective**: push all boxes `$` onto goals `.`; boxes on goals render as `*` and player-on-goal as `+`.\n\n### Quickstart\nRun an evaluation with default settings (downloads the full ~170MB dataset on first run):\n\n```bash\nuv run vf-eval boxoban\n```\n\nConfigure model, sampling, and env args:\n\n```bash\nuv run vf-eval boxoban \\\n  -m gpt-4.1-mini -n 16 -r 1 -t 1024 -T 0.7 \\\n  -a '{\n        \"difficulty\": \"unfiltered\",\n        \"split\": \"test\",\n        \"max_files\": 10,\n        \"num_examples\": 256,\n        \"max_turns\": 100,\n        \"download\": true\n      }'\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n- Large splits will take time to fetch on first run. Use `max_files` to subset for faster iteration.\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `difficulty` | str | `\"hard\"` | One of `unfiltered`, `medium`, `hard` |\n| `split` | str? | `null` | For `unfiltered`/`medium`: `train`/`valid`/`test` (or `null` for all) |\n| `max_files` | int | `-1` | Limit number of source files per split (each contains 1000 levels) |\n| `num_examples` | int | `256` | Number of levels to sample for the dataset |\n| `seed` | int | `0` | Shuffle seed for level sampling |\n| `download` | bool | `true` | Auto-download selected files from DeepMind repo |\n| `prefetch_all` | bool | `false` | If true, pre-download all difficulties/splits before sampling |\n| `max_turns` | int | `100` | Max agent turns per episode |\n| `console_log` | bool | `false` | Print step logs to stdout (`action`, `outcome`, `steps`, `solved`) |\n| `trace_in_state` | bool | `false` | Record a per-step trace with grids in `state.trace` |\n| `w_solved` | float | `1.0` | Reward weight for solved episodes |\n| `w_efficiency` | float | `0.2` | Reward weight for fewer steps (if solved) |\n| `w_validity` | float | `0.1` | Reward weight for valid-action ratio |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Weighted sum of rubric components |\n| `solved_reward` | 1.0 if all boxes on goals; else 0.0 |\n| `efficiency_reward` | Higher when solved in fewer steps (normalized by `max_turns`) |\n| `validity_reward` | Fraction of valid moves among attempted moves |\n| `format_reward` | Small bonus (+0.05) for adhering to `<think>...</think>` and `<final>...</final>` framing |\n\n### Determinism & Sampling\n- Default dataset: `difficulty=\"hard\"`, `seed=0`, `num_examples=256`. Levels are shuffled with `seed` and the first `num_examples` are used to build the dataset; `vf-eval -n` then evaluates the first `n` rows.\n- Reproducibility: Given the same `difficulty`, `split`, `seed`, and cached data, selection is deterministic. Concurrency does not affect which levels are chosen.\n- Exact selection example (3 fixed levels):\n  - `uv run vf-eval boxoban -m <model> -n 3 -r 1 -a '{\"difficulty\":\"hard\",\"seed\":0,\"num_examples\":3}'`\n","encoding":"utf-8","truncated":false,"total_bytes":3760},"status":null}