{"data":{"kind":"file","path":"README.md","version_id":"pbjrrk5sj7lx8hfgwna1v8uw","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":12115,"modified_at":"2026-06-13T07:09:26.155000","content_hash":"bc5e645ec7f521fdce6ff316e48c12c0433de940d5a9239821a2450b3dc6ca54"},"entries":[],"content":"# nethack\n\nA Prime Intellect verifiers environment for training and evaluating language-model agents on NetHack.\n\nThis is **layer 2** — a thin wrapper around the interface-agnostic `nethack_core` substrate. See `../../docs/design.md` for the full architecture and feature roadmap.\n\n## Quickstart\n\n```bash\n# from the repo root: fetch + build the NetHack fork engine first\n# (needs cmake/bison/flex/libbz2-dev). nle/minihack are no longer deps.\ngit submodule update --init --recursive\nbash nethack_core/build_engine.sh   # -> third_party/NetHack/src/build/libnethack.so\n\n# install the workspace (--all-packages pulls numpy/gymnasium, formerly\n# transitive via nle)\nuv sync --all-packages\n\n# smoke test against an OpenAI-compatible endpoint\nuv run vf-eval nethack -m gpt-4.1-mini -n 3 -r 1 -a '{\"tier\": \"mini_dungeon\"}'\n```\n\nSee [`../../docs/engine-layer.md`](../../docs/engine-layer.md) for the engine\nAPI (snapshot/branch, level blobs, state modification, difficulty knobs).\n\n## Arguments\n\n`load_environment(...)` accepts:\n\n| arg                | type             | default              | meaning                                              |\n|--------------------|------------------|----------------------|------------------------------------------------------|\n| `tier`             | str or None      | `\"corridor_explore\"` | Curriculum tier name; None = uniform across all      |\n| `n_examples`       | int              | 256                  | Dataset size                                         |\n| `seed`             | int              | 0                    | RNG seed for dataset construction                    |\n| `max_turns`        | int              | 200                  | Per-rollout LM turn cap                              |\n| `interface`        | str              | `\"skill\"`            | `\"skill\"` (one tool per skill) or `\"code\"` (sandboxed Python with `nh` namespace) |\n| `sub_lm`           | SubLM or None    | None                 | Backend for `nh.summarize/plan/recall_lm`. Default at rollout time: `OfflineSubLM` |\n| `subgoal_proposer` | Proposer or None | None                 | Backend for the `dynamic_subgoal` tier. Default: `OfflineSubgoalProposer` |\n| `variant`          | str              | `\"B1\"`               | Observation/skill preset (see [Observation variants](#observation-variants)). |\n| `compact_obs`      | bool             | True                 | Glyph-run encoding, blank-row strip, inventory diff. Token lever, not a capability lever. |\n| `skill_set`        | str              | `\"full\"`             | `\"full\"`, `\"dir8\"`, `\"move\"`, or a CSV whitelist of skills (NetPlay uses a curated CSV with no low-level `move`). |\n| `trace_dir`        | str or None      | None                 | If set, writes per-turn NDJSON (raw grid + rendered obs + assistant msg + tool calls + reward) for offline replay. |\n| `continual`        | bool             | False                | Auto-reseed NLE on death and carry the journal/belief state across lives. |\n| `continual_lives`  | int              | 5                    | Max lives when `continual=True`. |\n\n### CLI gotcha: `-a` vs `-x`\n\nOverride env args from the CLI with `-a` (env-args, baked at construction), NOT\n`-x` (extra-env-kwargs, applied via `env.set_kwargs()` AFTER construction):\n\n```bash\nprime eval jonathanliu/nethack -m Qwen/Qwen3.5-9B -n 1 -r 1 \\\n  -a '{\"tier\": \"dynamic_subgoal\", \"interface\": \"code\", \"max_turns\": 30}'\n```\n\n`interface` (skill vs code) bakes the tool list at construction time, so passing\nit via `-x` is silently ignored. The hosted-eval writeup for Qwen3.5-9B v0.0.14\nhit exactly this: `-x '{\"max_turns\": 30}'` had no effect and the rollout ran to\nthe default cap of 200 turns. **Always pass env config through `-a`.** See\n`docs/EVAL_RECIPES.md`.\n\n## Observation variants\n\nThe `variant` kwarg selects a per-turn observation/skill preset. These let you\nA/B the observation surface without touching env internals; each is a single\n`load_environment(variant=...)` setting. They are wired up and swept by\n`experiments/exp16_obs_variants.py`; see `experiment_log.md` for findings.\n\n| code | source | what it changes |\n|------|--------|-----------------|\n| `B1` | current default | Standing baseline: ASCII grid + compaction + journal. |\n| `B0` | calibration | All compaction off (raw rendering). Isolates whether compaction is load-bearing. |\n| `G`  | Glyphbox (Wang, 2026) | ASCII + adjacency + hostile-list + code-mode tool surface. |\n| `B`  | BALROG (Paglieri et al., ICLR 2025) | No ASCII grid; natural-language scene description only. |\n| `N`  | NetPlay (Jeurissen, CoG 2024) | Skill-only action surface (no low-level `move(direction=…)`). |\n| `R`  | CPP/GPP | Belief state every 25 turns + hard-drop history before the last checkpoint. |\n| `P`  | Continual Harness (arXiv:2605.09998) | Periodic self-refinement directive (update journal objective / record a lesson). |\n| `CH` | Continual Harness (full) | Teacher \"Refiner\" model edits prompt + sub-agents + skill macros + memory. |\n| `ND` | this repo | NetPlay skill set + a persistent `=== DESCENT STATUS ===` salience block. |\n| `FD` | this repo | `find_and_descend` autopilot skill surface + descent salience block. |\n| `E1` | this repo (Wave-3 C) | Surfaces `find_frontiers` output: `=== FRONTIERS ===` (top-5 nearest, with bearing + tile kind), `=== EXPLORATION ===` (coverage + per-turn scout delta), `=== SPATIAL BELIEF ===` (bearings + known stairs coords). Replaces the legacy descent-salience exhortation with pure spatial information. Skill-only + compacted obs (same as `N`). |\n\n**Findings so far** (preliminary, Qwen3.5-9B, seeds 22–26, 200-turn budget):\nthe ASCII grid is load-bearing — `B` (no grid) collapses capability. Compaction\n(`B0` vs `B1`) is a token/cost lever, not a capability lever. The descent\nbottleneck (reaching dungeon level 2) is the dominant failure mode: agents\nexplore but starve or die while looping on the first level. Skill-only surfaces\n(`N`) and the `v0.0.65` deadlock-breaker are the levers under active study;\nsee `experiment_log.md` for the live numbers.\n\n## Tiers\n\nAll tiers now run on the NetHack fork engine. The former MiniHack synthetic\ntiers have been **retired** in the engine migration; a `nle_task` containing\n`\"MiniHack\"` raises at construction. Synthetic levels are now produced via the\nengine's level-blob load path instead (`save_level`/`load_level`,\n[`../../docs/engine-layer.md`](../../docs/engine-layer.md)).\n\n### Native NetHack tiers\n\n| tier               | nle_task          | max_steps | success milestone                | description                                            |\n|--------------------|-------------------|-----------|----------------------------------|--------------------------------------------------------|\n| `corridor_explore` | `NetHackScore-v0` | 2,000     | `reach_dlvl(2)`                  | **Default.** Real NetHack; reach dungeon level 2.      |\n| `mini_dungeon`     | `NetHackScore-v0` | 4,000     | `reach_dlvl(3)`                  | Reach dungeon level 3.                                 |\n| `mines_to_minetown`| `NetHackScore-v0` | 8,000     | `mine_town_milestone`            | Find the Gnomish Mines branch; reach Mine Town.        |\n| `sokoban_complete` | `NetHackScore-v0` | 10,000    | `sokoban_complete_milestone`     | Solve the Sokoban puzzle branch.                       |\n| `oracle_consult`   | `NetHackScore-v0` | 8,000     | `oracle_consult_milestone`       | Find and pay the Oracle of Delphi.                     |\n| `full_dungeon_easy`| `NetHackScore-v0` | 10,000    | `reach_dlvl(6)`                  | Standard NetHack with reduced max depth.               |\n| `full_nle`         | `NetHackScore-v0` | 100,000   | none (ascension via tty markers) | The full game. Ascend.                                 |\n| `dynamic_subgoal`  | `NetHackScore-v0` | 4,000     | per-rollout (LLM-proposed)       | Proposer LLM emits an objective + termination_check; the env compiles it into a Milestone. |\n\n### MiniHack synthetic (retired)\n\nThe old `empty_room` / `solo_combat` / `multi_combat` MiniHack tiers (formerly\n`MiniHack-Skill-Custom-v0`, gated behind `pip install nethack[minihack]`) have\nbeen removed. `minihack` is no longer a dependency. Selecting a `\"MiniHack\"`\ntask now raises at `NetHackCoreEnv` construction. The replacement for fixed\nsynthetic levels is the engine's concrete level-blob path (generate a floor,\n`save_level` it to an asset, `load_level` it at reset).\n\n## Rewards\n\nThe rubric is built from four `@vf.reward(weight=...)` functions in\n`nethack.py`:\n\n| reward             | weight | fires on                                                                 |\n|--------------------|--------|--------------------------------------------------------------------------|\n| `scout_reward`     | 1.0    | Per-step `scout_delta / 1000.0` — newly-revealed dungeon tiles this step. |\n| `descent_reward`   | 10.0   | +1 (× weight) the first time the agent reaches a new max dungeon level.   |\n| `success_reward`   | 100.0  | +1 (× weight) when the tier's `success_milestone` fires.                  |\n| `ascension_reward` | 1000.0 | +1 (× weight) when `_detect_terminal_outcome` finds an ascension marker.  |\n\nWe deliberately do **not** use NetHack's in-game score as a training signal —\nit's gameable. See design doc §3.4. The four shaped rewards form an\nexponentially-spaced ladder (1 → 10 → 100 → 1000) so the gradient always\npoints at the deepest unlocked rung.\n\n### Reading the reward signal\n\n`avg_score` reported by `prime eval` is the **unweighted sum** of the four\nraw reward-function values, *not* the rubric-weighted total. Decompose it\nwith `prime eval samples <id> -o json` — each sample carries `scout_reward`,\n`descent_reward`, `success_reward`, and `ascension_reward` directly. A score\nof `2.155`, for example, is `scout 0.155 + descent 1 + success 1` — a rollout\nthat explored, descended to dlvl 2, and fired the `corridor_explore`\nmilestone. Real Qwen3.5-9B rollouts reach this; scout reward accumulates\ncorrectly across the trajectory.\n\nTwo things to keep in mind when interpreting short evals:\n\n1. **Sparse by design.** `descent_reward`/`success_reward`/`ascension_reward`\n   only fire on milestones. For a non-fine-tuned LM, only `scout_reward` is\n   expected to be nonzero until the agent actually descends.\n2. **Per-step averaging hides scout reward.** If you look at verifiers'\n   per-step `avg_metrics` rather than the trajectory sum, `scout_reward`\n   (≤ ~0.05/step, exactly 0 on steps that reveal no new tiles) rounds to 0.0\n   in a two-decimal display. Sum across the trajectory, or read\n   `state[\"scout_tiles_seen\"]`, to see it accumulating.\n\nImplementation notes for anyone extending the rubric: scout tiles are keyed\nby `(max_dlvl_reached, x, y)`, and `max_dlvl_reached` is bumped at the end of\n`env_response`, so the first step on a new dlvl attributes its tiles to the\nprevious dlvl. Journal-op skills deliberately zero `scout_delta` and return\nbefore stepping, so a journal-heavy agent shows `scout_reward: 0` for those\nturns regardless of what's on screen.\n\n### Replaying rollouts\n\n`tools/render_rollout_video.py` renders an animated GIF/MP4 of a rollout\n(ASCII map + status + per-turn tool call) from either a hosted eval\n(`--eval-id`) or a local `trace_dir` NDJSON (`--ndjson`). `tools/dashboard.py`\nis a browseable web dashboard over all evals: per-variant reward decomposition\nplus a turn-by-turn replay view.\n\n## Status\n\nLive on the Hub at [`jonathanliu/nethack`](https://app.primeintellect.ai/dashboard/environments/jonathanliu/nethack).\nPublished: **v0.0.64** (hosted eval pins the latest published version, not\nlocal code). Verified end-to-end against Qwen3.5-9B in hosted eval across the\nobservation variants above — no crashes, both `skill` and `code` interfaces.\nRollouts reach descent + the `corridor_explore` success milestone (e.g. the\nNetPlay `N` variant on seeds 22–23). The descent-reliability work in\n`v0.0.65` (deadlock-breaker + descent-salience obs) is under validation; see\n`experiment_log.md` and `experiments/results/` for the live numbers.\n","encoding":"utf-8","truncated":false,"total_bytes":12115},"status":null}