{"data":{"kind":"file","path":"README.md","version_id":"uk49az5l0m6ztdkav8qlmaug","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2132,"modified_at":"2025-12-17T19:56:49.635000","content_hash":"94df9763e36837ac6f123f0aece2b2c1666f01750d85d69315e1f99c9eef7382"},"entries":[],"content":"# pacman-image\n\n### Overview\n- **Environment ID**: `pacman-image`\n- **Short description**: Multi-turn, image-only Pacman where an LLM receives screenshots and responds with `<action>` tags.\n- **Tags**: pacman, multi-turn, vision, game, eval, train\n\n### Datasets\n- **Primary dataset(s)**: Synthetic rollouts; each example spins up a fresh Pacman game seeded from code defaults.\n- **Source links**: Generated in-code via `datasets.Dataset.from_dict`.\n- **Split sizes**: Controlled by `num_examples` (default 10) for both eval and training-style runs.\n\n### Task\n- **Type**: multi-turn\n- **Parser**: `XMLParser(fields=[\"action\"], answer_field=\"action\")`\n- **Rubric overview**: Win bonus (1.0), normalized score (0.4 weight), survival time (0.1), and a small format reward (0.1) to enforce `<action>` tags.\n\n### Quickstart\nRun an evaluation with defaults:\n\n```bash\nuv run vf-eval pacman-image\n```\n\nCustomize model/sampling and env args (example: fewer games, slower control loop):\n\n```bash\nuv run vf-eval pacman-image \\\n  -m gpt-4.1-mini \\\n  -n 5 -r 3 -t 1024 -T 0.7 \\\n  -a '{\"num_examples\": 5, \"max_turns\": 75, \"action_repeat\": 2}'\n```\n\nNotes:\n- `-a` / `--env-args` passes JSON straight into `load_environment`.\n- Rendering is headless-friendly (`SDL_VIDEODRIVER=dummy`) but still saves PNGs to `screenshot_dir`.\n\n### Environment Arguments\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `num_examples` | int | `10` | Number of independent Pacman games to evaluate. |\n| `max_turns` | int | `200` | Max LLM decisions per game (overrides CLI default of 50 if set). |\n| `action_repeat` | int | `1` | How many game frames to repeat each chosen action. |\n| `screenshot_dir` | str | `\"screenshots\"` | Where PNG frames are stored per rollout/task. |\n\n### Metrics\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Weighted rubric score (win + score + survival + format). |\n| `format_reward` | Parser-conformance bonus for well-formed `<action>` tags. |\n| `score_normalized` | Game score normalized to ~[0,1] for comparison across runs. |\n| `win` | Binary indicator for clearing all pellets without dying. |\n","encoding":"utf-8","truncated":false,"total_bytes":2132},"status":null}