{"data":{"kind":"file","path":"README.md","version_id":"e8ov75zgclzrydrjchnp61d0","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2901,"modified_at":"2025-09-03T00:06:59.151000","content_hash":"97ea4e0fe4e0b25d1de0a9316c3b3faa73c097e9804177239af7d1e3339d1ec7"},"entries":[],"content":"# hud-browser-2048\n\n### Overview\n- **Environment ID**: `hud-browser-2048`\n- **Short description**: Browser-based 2048 game for evaulating agents using visual observations and keyboard actions\n- **Tags**: `game`, `browser`, `multimodal`, `CUA`, `2048`\n\n### Datasets\n- **Primary dataset(s)**: Built-in tasks with varying difficulty (64 to 256 tiles)\n- **Source links**: Generated programmatically in environment loader\n- **Split sizes**: 3 tasks (expandable)\n\n### Task\n- **Type**: multi-turn, tool use\n- **Parser**: ToolXMLParser with action validation\n- **Rubric overview**: Task completion (80%), format compliance (10%), tool execution (10%)\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval hud-browser-2048\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval hud-browser-2048 \\\n  -m gpt-4.1-mini \\\n  -n 1 -r 3 \\\n  -a '{\"max_turns\": 150}'  # env-specific args as JSON\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n\n### Environment Arguments\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `max_turns` | int | `100` | Maximum moves allowed per game |\n| `system_prompt` | str | (see config) | Override agent instructions |\n| `config_path` | str | `./config.yaml` | Path to alternative config file |\n\n### Metrics\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Main scalar reward (weighted sum of task completion, format compliance, tool execution) |\n| `task_completion` | Logarithmic scale based on highest tile vs target (min(1.0, log(highest)/log(target))) |\n| `tool_execution` | Ratio of successful tool calls |\n| `format_compliance` | Score for correct XML formatting and action syntax |\n\n### How It Works\n\nThis environment uses `hud-vf-gym` to connect the browser-based 2048 game with the Verifiers RL framework:\n\n1. **Browser Control**:\n   - Uses `hudevals/hud-browser` Docker image with Playwright\n   - Takes screenshots to observe game state\n   - Sends keyboard inputs (arrow keys) to play\n\n2. **Action Mapping**:\n   - Agent calls `screenshot()` to see the game\n   - Agent calls `up()`, `down()`, `left()`, `right()` for moves\n   - Maps to environment's `computer` tool with appropriate key presses\n\n3. **Task Flow**:\n   - Setup: Launches 2048 web app in browser\n   - Play: Agent uses screenshots and arrow keys\n   - Evaluate: Checks if target tile was reached\n\nThe browser environment demonstrates HUD MCP's ability to wrap web applications as RL environments, enabling agents to learn from visual observations and keyboard interactions.\n\n### Relevant Links\n\n- [HUD Documentation](https://docs.hud.so)\n- [Build Your Own Environment](https://docs.hud.so/build-environments)\n- [Train Agents with Verifiers](https://docs.hud.so/train-agents)\n- [HUD Python SDK](https://github.com/hud-evals/hud-python)\n- [HUD VF Gym Adapter](https://github.com/hud-evals/hud-vf-gym)","encoding":"utf-8","truncated":false,"total_bytes":2901},"status":null}