{"data":{"kind":"file","path":"README.md","version_id":"r18kih4hu44hgchjh1q4rh8u","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":11766,"modified_at":"2025-08-28T18:48:20.743000","content_hash":"429a3a9f17313904b754ca8c9145171863f95012af30478574af97a23128e6ad"},"entries":[],"content":"# balrog-prime\r\n\r\nSource implementation: https://github.com/balrog-ai/BALROG.git\r\n\r\nUnified adapter that exposes BALROG (Benchmarking Agentic LLM and VLM Reasoning On Games) environments (NLE, MiniHack, BabyAI, TextWorld, Babaisai, Crafter) as a verifiers MultiTurnEnv while preserving BALROG-style agent interaction.\r\n\r\n### Overview\r\n- Environment ID: `balrog-prime`\r\n- Short description: Adapter to run BALROG RL environments through verifiers using a multi-turn chat loop that mirrors BALROG’s agent→env protocol.\r\n- Tags: multi-turn, balrog, NLE, MiniHack, BabyAI, TextWorld, Babaisai, Crafter, eval, VLM, interactive, long-horizon, Reasoning, Game, Agentic\r\n\r\n### Datasets\r\n- Primary dataset(s): Synthetic episodes built on-the-fly from BALROG’s config (envs+tasks). Each row represents an episode with initial observation/instruction captured by pre-resetting the underlying BALROG env.\r\n- Source: BALROG is installed from upstream git as a package dependency (no local checkout required)\r\n- Split sizes: By default, train and eval are the same constructed rows; `num_eval_samples` controls how many rows per task are produced.\r\n\r\n### Task\r\n- Type: multi-turn\r\n- Parser: Permissive free-form parser that:\r\n  - Extracts `<action>...</action>` if present, else\r\n  - Falls back to the last non-empty line of the assistant’s message.\r\n  - Numeric values are optionally mapped to an action by index if an action vocabulary is available.\r\n  - Action validity is enforced by BALROG’s `EnvWrapper.check_action_validity` exactly like BALROG’s evaluator.\r\n- Rubric overview (defaults can be tuned via weights):\r\n  - `success_reward`: 1.0 if episode ends with true termination (not just time truncation); else 0.0.\r\n  - `progress_reward`: normalized episode return as a small shaping signal (off by default).\r\n  - `efficiency_reward`: higher when solved in fewer steps (off by default).\r\n  - `format_reward`: presence of a parsable action (off by default).\r\n\r\n### Quickstart\r\nRun an evaluation with default settings (uses BALROG config to pick tasks):\r\n```bash\r\nuv run vf-eval balrog-prime\r\n```\r\n\r\nSpecify environment and tasks explicitly:\r\n```bash\r\nuv run vf-eval balrog-prime \\\r\n  -a '{\"env_name\":\"textworld\",\"tasks\":[\"treasure_hunter\"],\"num_eval_samples\":2}' \\\r\n  -n 2\r\n```\r\nNote: set `-n = len(tasks) × num_eval_samples`.\r\n\r\nApply BALROG config overrides (OmegaConf dot-keys):\r\n```bash\r\nuv run vf-eval balrog-prime \\\r\n  -a '{\r\n        \"env_name\": \"nle\",\r\n        \"tasks\": [\"NetHackChallenge-v0\"],\r\n        \"overrides\": {\"eval.max_steps_per_episode\": 60, \"envs.nle_kwargs.skip_more\": true}\r\n      }'\r\n```\r\n\r\nConfigure model/sampling:\r\n```bash\r\nuv run vf-eval balrog-prime \\\r\n  -m gpt-4.1-mini -n 10 -r 1 -t 2048 -T 0.7\r\n```\r\n\r\n### Smell test\r\n\r\nThe smell test consists of the following command:\r\n\r\n```bash\r\nexport ENVLIST=\"nle minihack babyai textworld babaisai crafter\"\r\nexport MODEL=\"gpt-4o\"\r\nexport BASE=\"https://api.openai.com/v1\"\r\nfor ENV in $ENVLIST; do   echo \"== $ENV :: 1 task × 1 episode ==\";   uv run vf-eval -s balrog-prime     -m \"$MODEL\" -b \"$BASE\" -k \"$KEY_VAR\"     -n 1     -a \"{\\\"env_name\\\":\\\"$ENV\\\",\\\"num_eval_samples\\\":1,\\\"include_images\\\":true,\\\"image_transport\\\":\\\"structured\\\",\\\"image_max_history\\\":1,\\\"overrides\\\":{\\\"eval.max_steps_per_episode\\\":50}}\"; done\r\n```\r\n\r\nAll outputs can be found in the outputs/evals directory.\r\n\r\n### VLM example\r\nIf your chosen BALROG sub-environment emits frames (e.g., NLE under certain configs), you can enable multimodal prompts:\r\n```bash\r\nuv run vf-eval balrog-prime \\\r\n  -m gpt-4.1-mini \\\r\n  -n 4 \\\r\n  -a '{\"env_name\":\"nle\",\"num_eval_samples\":2,\"include_images\":true,\"image_transport\":\"structured\",\"image_max_history\":1}'\r\n```\r\nNotes:\r\n- `image_transport=\"structured\"` is recommended (OpenAI-style content parts). `\"data_url\"` inlines base64 into text for debugging only.\r\n- For image emission, BALROG requires `agent.max_image_history > 0`. The adapter sets this automatically if `include_images=true`.\r\n\r\n### Task inventory and choosing -n\r\nSince eval creates K=num_eval_samples episodes per task, the total eval rows is:\r\n- total_rows = len(tasks) × K\r\n- Set -n ≤ total_rows\r\n\r\nList tasks for an env (reads BALROG’s installed config):\r\n```bash\r\npython - <<'PY'\r\nfrom importlib import resources\r\nfrom omegaconf import OmegaConf\r\nenv = \"babyai\"  # change to: nle | minihack | babyai | textworld | babaisai | crafter\r\ncfg = OmegaConf.load(resources.files(\"balrog\") / \"config\" / \"config.yaml\")\r\ntasks = list(getattr(cfg.tasks, f\"{env}_tasks\"))\r\nprint(f\"env={env}, num_tasks={len(tasks)}\")\r\nfor t in tasks:\r\n    print(\"-\", t)\r\nPY\r\n```\r\n\r\nCompute -n automatically for “all tasks”:\r\n```bash\r\nENV=babyai      # change as needed\r\nK=10            # num_eval_samples\r\npython - <<PY\r\nfrom importlib import resources\r\nfrom omegaconf import OmegaConf\r\nenv = \"$ENV\"\r\nk = int(\"$K\")\r\ncfg = OmegaConf.load(resources.files(\"balrog\") / \"config\" / \"config.yaml\")\r\ntasks = list(getattr(cfg.tasks, f\"{env}_tasks\"))\r\nprint(len(tasks) * k)\r\nPY\r\n```\r\n\r\nYou can embed that value directly into your run, for example (WSL bash):\r\n```bash\r\nENV=babyai\r\nK=10\r\nN=$(python - <<PY\r\nfrom importlib import resources\r\nfrom omegaconf import OmegaConf\r\nenv = \"$ENV\"; k = int(\"$K\")\r\ncfg = OmegaConf.load(resources.files(\"balrog\") / \"config\" / \"config.yaml\")\r\ntasks = list(getattr(cfg.tasks, f\"{env}_tasks\"))\r\nprint(len(tasks) * k)\r\nPY\r\n)\r\nuv run vf-eval balrog-prime -m gpt-4.1 -b https://api.openai.com/v1 -k OPENAI_API_KEY \\\r\n  -n \"$N\" -a \"{\\\"env_name\\\":\\\"$ENV\\\",\\\"num_eval_samples\\\":$K}\"\r\n```\r\n\r\nNotes:\r\n- BALROG is installed from git (declared in pyproject). No local checkout or sys.path tweaking is required.\r\n- You must have the dependencies for the selected BALROG env installed (e.g., NLE/MiniHack/TextWorld assets).\r\n\r\n### Assets and auto-download\r\nSome BALROG sub-environments require external assets that are not bundled inside Python wheels (size/licensing). This adapter bootstraps them on first use (can be disabled).\r\n\r\n- TextWorld\r\n  - Needs: Pre-generated game bundles (tw_games).\r\n  - Behavior: Automatically downloaded into the installed `balrog` package directory on first use.\r\n  - Disable auto-download:\r\n    - Add to env args: `{\"auto_download_assets\": false}`\r\n    - Place games manually under: `$(python -c \"import importlib.resources as r, pathlib; print(pathlib.Path(r.files('balrog')).parent / 'tw_games')\")`\r\n  - System prereqs (Ubuntu): `sudo apt update && sudo apt install -y build-essential libffi-dev python3-dev curl git`\r\n\r\n- MiniHack (Boxoban tasks only)\r\n  - Needs: Boxoban level packs for tasks like `MiniHack-Boxoban-*`.\r\n  - Behavior: If Boxoban maps are missing and `auto_download_assets=true` (default), we invoke the official downloader via Python (`minihack.scripts.download_boxoban_levels`). If maps remain unavailable, Boxoban tasks are skipped with a warning while other MiniHack tasks run.\r\n  - Disable auto-download:\r\n    - Add to env args: `{\"auto_download_assets\": false}`\r\n    - Manual fetch (in the same venv used by vf-eval):  \r\n      `uv run python -m minihack.scripts.download_boxoban_levels`\r\n\r\n- NLE, BabyAI (Text wrapper), Crafter, Babaisai\r\n  - No extra asset downloads required beyond the Python packages.\r\n\r\nDisable auto-download globally for a run:\r\n```bash\r\nuv run vf-eval balrog-prime -m gpt-4.1 -b https://api.openai.com/v1 -k OPENAI_API_KEY \\\r\n  -n 6 -a '{\"env_name\":\"minihack\",\"num_eval_samples\":1,\"auto_download_assets\":false}'\r\n```\r\n\r\n### Environment Arguments\r\nThese are passed via `-a/--env-args` as JSON and map directly to `load_environment(...)`.\r\n\r\n| Arg | Type | Default | Description |\r\n| --- | ---- | ------- | ----------- |\r\n| `env_name` | str | `\"nle\"` | One of: `nle`, `minihack`, `babyai`, `textworld`, `babaisai`, `crafter`. |\r\n| `tasks` | list[str] | from BALROG config | Tasks for the chosen env; if omitted, uses `config.tasks[{env}_tasks]`. |\r\n| `num_eval_samples` | int | `5` | Per-task episode rows to construct. |\r\n| `balrog_config_path` | str | auto-detected | Resolved inside the installed package at `balrog/config/config.yaml`. |\r\n| `overrides` | dict | `{}` | OmegaConf deep updates, e.g. `{\"eval.max_steps_per_episode\": 200}`. |\r\n| `include_action_list` | bool | `true` | Include a concise action vocabulary list in messages (truncated to 30). |\r\n| `invalid_parse_strikes` | int | `2` | Number of parse failures tolerated before stronger warnings. |\r\n| `base_seed` | int or null | `null` | Base seed for episode construction (for reproducibility). |\r\n| `rubric_weights` | dict | `{\"success\":1.0,\"progress\":0.0,\"efficiency\":0.0,\"format\":0.0}` | Weights for reward components. |\r\n| `auto_download_assets` | bool | `true` | Auto-fetch TextWorld games and MiniHack Boxoban maps when missing (skips Boxoban tasks if unavailable). |\r\n| `reward_mode` | str | `\"return\"` | One of `\"return\"`, `\"success\"`, `\"progress\"`, `\"hybrid\"`. Selects primary reward composition. |\r\n| `include_images` | bool | `false` | Enable VLM image delivery from BALROG wrappers (requires env support). |\r\n| `image_transport` | str | `\"structured\"` | How to deliver images: `\"structured\"` (recommended, uses content parts with `image_url`), `\"data_url\"` (embed into text; debug only), or `\"none\"`. |\r\n| `image_format` | str | `\"png\"` | Image encoding format for data URLs and saved files. |\r\n| `image_max_history` | int | `1` | Number of recent images to include per turn when images are enabled. |\r\n| `provider` | str | `\"openai\"` | Provider hint for structured formatting (OpenAI-style content parts). |\r\n| `log_multimodal_payload` | bool | `false` | When true, writes the exact ChatMessage payloads to `outputs/balrog_prime_payloads/...` with base64 truncated. |\r\n| `save_images_debug` | bool | `false` | Save per-step frames to disk for inspection when available. |\r\n| `on_invalid_parse` | str | `\"warn\"` | Behavior once `invalid_parse_strikes` is reached: `\"warn\"` (default), `\"show_actions\"` (inject allowed actions once), or `\"truncate\"` (end the episode early). |\r\n- Validity is enforced by BALROG’s `check_action_validity`.\r\n\r\n### Invalid parse handling\r\nThis environment counts consecutive parse failures (i.e., when the model output doesn’t contain a usable action). Control behavior with:\r\n- `invalid_parse_strikes` (default: 2): threshold for escalation\r\n- `on_invalid_parse`:\r\n  - `\"warn\"`: keep reminding the model to output a single action (default)\r\n  - `\"show_actions\"`: inject a concise list of allowed actions (truncated to 30) after the threshold\r\n  - `\"truncate\"`: mark the episode as truncated and terminate to conserve tokens\r\n\r\nExample:\r\n```bash\r\nuv run vf-eval balrog-prime \\\r\n  -a '{\"env_name\":\"nle\",\"invalid_parse_strikes\":2,\"on_invalid_parse\":\"show_actions\"}'\r\n```\r\n\r\n### Metrics\r\n- `reward`: Weighted sum of rubric components.\r\n- `success_reward`: 1.0 when the episode ends with true termination (not just timeout).\r\n- `progress_reward`: Normalized episode return proxy (off by default).\r\n- `efficiency_reward`: Higher when solved in fewer steps (off by default).\r\n- `format_reward`: Positive when an action can be parsed (off by default).\r\n\r\n### Requirements and setup\r\n- Ensure the required external dependencies for the selected BALROG environment are installed (e.g., NLE, MiniHack, TextWorld, etc.) and assets are available.\r\n- BALROG is installed from git and imported as a normal package; config is located via importlib.resources.\r\n\r\n### Implementation notes\r\n- Episodes are pre-initialized to capture the initial question (instruction + observation), then the live env session is maintained across turns via an in-memory session manager keyed by `episode_id`.\r\n- State is stored as JSON (step, done flags, returns, etc.) similar to other verifiers envs; env objects themselves are kept in memory (not serialized).\r\n","encoding":"utf-8","truncated":false,"total_bytes":11766},"status":null}