{"data":{"kind":"file","path":"README.md","version_id":"enqzv6w3nhkp4fnqz69n67x4","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":20395,"modified_at":"2026-05-14T10:14:19.726000","content_hash":"211885667955a2218f5364842b52e6379f54cba7a24fd38941ecba2221a732d7"},"entries":[],"content":"# megaminx-solver\n\n[![Prime Hub](https://img.shields.io/badge/Prime%20Hub-setrf%2Fmegaminx--solver-blue)](https://app.primeintellect.ai/dashboard/environments/setrf/megaminx-solver)\n[![Latest env](https://img.shields.io/badge/env-v0.2.57-0f766e)](https://app.primeintellect.ai/dashboard/environments/setrf/megaminx-solver)\n[![Breakthrough eval](https://img.shields.io/badge/v0.2.22%2035B-0.929-2563eb)](https://app.primeintellect.ai/dashboard/evaluations/vse6uoo8c9y156svyyv41qll)\n[![GitHub](https://img.shields.io/badge/GitHub-setrf%2Fmegaminx--world--model--bench-black)](https://github.com/setrf/megaminx-world-model-bench)\n\n`megaminx-solver` is a trainable Prime Intellect / Verifiers environment for\nstudying whether language models can learn a physical puzzle world through\nstate, tool use, reward, and repeated rollouts.\n\nThe environment simulates a symbolic Megaminx, a twelve-faced twisty puzzle.\nModels observe compact facelet or sensor text, act through native tools, and\nreceive deterministic rewards from a persistent simulator. The current release\nfocuses on a hardened depth-2 candidate-path curriculum for native Hosted\nTraining and an oracle trajectory exporter for SFT warm starts.\n\n## What This Environment Tests\n\nThis is not a trivia benchmark. The model must interact with a world whose\nstate changes after every action.\n\nThe environment tests:\n\n- **Physical state tracking:** stickers move when a face turns, and future\n  observations depend on earlier actions.\n- **Local geometry:** each face turn affects one pentagonal face and a ring of\n  adjacent strips.\n- **Tool grounding:** the answer is not a text string; the model must call the\n  right tool with valid arguments.\n- **Planning horizon:** depth-1 tasks test first-action grounding; depth-2\n  tasks require acting, reading the refreshed world state, and acting again.\n- **Protocol discipline:** native tool calls, text fallback actions, malformed\n  calls, and protocol violations are measured separately.\n- **Shortcut resistance:** hidden scramble metadata is never printed, and later\n  releases remove visible masks, row ids, fixed slots, and other accidental\n  hints found during audits.\n\n## Rollout Contract\n\nEvery rollout follows this contract:\n\n1. A deterministic scramble is generated from `split`, `seed`, and example\n   index.\n2. The simulator applies the scramble to a solved Megaminx.\n3. The inverse solution is stored in task metadata for scoring, tests, and\n   metrics, but not shown to the model.\n4. The prompt prints a compact observation or staged sensor/candidate view.\n5. The model acts through native tools or, only for configured JSON prompt\n   styles, a text action fallback.\n6. The environment mutates the same puzzle state after each action.\n7. The rubric scores the final trajectory and emits diagnostic metrics.\n\nThe implementation class is `MegaminxEnv(vf.StatefulToolEnv)`, so rollout state\nis persistent rather than reconstructed from text.\n\n## Observations\n\nThe base observation includes:\n\n- solved/not-solved status\n- sticker accuracy\n- piece accuracy\n- move budget\n- last move\n- compact face lines for faces `A` through `L`\n\nA solved face contains only its own label. Centers are fixed and identify the\ntarget color for a face. Edge and corner strings show the sticker labels\ncurrently visible on that face.\n\nStaged prompt styles may replace or augment the full net with sensor tables,\ndirection-flow evidence, candidate faces, candidate-relative flow tokens, or a\nrefreshed candidate table after the first action.\n\nPrompts intentionally do not expose:\n\n- scramble moves\n- inverse solution\n- hidden answer\n- row-derived ids in the current v0.2.56 candidate-path lane\n- hidden candidate-slot seeds\n\n## Tools\n\nGeneral puzzle tools:\n\n| Tool | Arguments | Meaning |\n| --- | --- | --- |\n| `rotate` | `face: A-L`, `direction: cw|ccw` | Apply one legal Megaminx face turn. |\n| `inspect` | `face: A-L|all` | Return one face or the full compact net without changing state. |\n| `finish` | none | End the rollout and report the final state. |\n\nStaged tools used by research lanes:\n\n| Tool | Arguments | Meaning |\n| --- | --- | --- |\n| `select_candidate` | `index: 1-4`, `direction: cw|ccw` | Rotate the face shown in a visible candidate slot. Current depth-2 lanes call this twice. |\n| `select_candidate_index` | `index: 1-4` | Select a candidate face without choosing direction. Used for face-discovery ablations. |\n| `predict_rotate` | `face`, `direction`, `predicted_after` | Predict five local post-move strips, then apply the move. |\n\nCandidate prompt styles expose only their matching candidate tool by default.\nFor example, `stage_candidate_relative_flow_rule_solve2_native_tool` exposes\n`select_candidate` rather than the full general tool set.\n\nImportant schema detail: `MegaminxEnv` implements all six tools. With\n`exposed_tool_names=None`, ordinary non-candidate prompt styles expose all six\nimplemented tools even though the prompt asks the model to use the general\n`rotate`/`inspect`/`finish` loop. Candidate prompt styles automatically narrow\nthe schema to `select_candidate` or `select_candidate_index`.\n\n## `load_environment` API\n\n```python\nfrom megaminx_solver import load_environment\n\nenv = load_environment(\n    split=\"train\",\n    min_depth=1,\n    max_depth=8,\n    num_examples=200,\n    seed=42,\n    max_turns=None,\n    move_budget=None,\n    reward_style=\"dense\",\n    prompt_style=\"default\",\n    allow_text_tool_actions=None,\n    exposed_tool_names=None,\n)\n```\n\n| Argument | Default | Description |\n| --- | --- | --- |\n| `split` | `\"train\"` | Named curriculum split. |\n| `min_depth` | `1` | Minimum scramble depth for custom splits. |\n| `max_depth` | `8` | Maximum scramble depth for custom splits. |\n| `num_examples` | `200` | Deterministic dataset size. |\n| `seed` | `42` | Dataset seed. |\n| `max_turns` | `None` | Global Verifiers turn cap. |\n| `move_budget` | `None` | Puzzle move budget shown to the model. |\n| `reward_style` | `\"dense\"` | Rubric/reward family. |\n| `prompt_style` | `\"default\"` | Observation and tool protocol family. |\n| `allow_text_tool_actions` | `None` | Enables JSON/private-text action parsing for JSON prompt styles. |\n| `exposed_tool_names` | `None` | Optional explicit tool subset. |\n\nBy default, `allow_text_tool_actions` is true only for JSON prompt styles and\nfalse for native Hosted Training styles.\n\nPublic package exports:\n\n```text\nMegaminxEnv\nbuild_dataset\nload_environment\nMegaminxPuzzle\nMegaminxTopology\nFACES\nPOSITIONS_PER_FACE\nSTICKERS_PER_PUZZLE\nEDGE_COUNT\nCORNER_COUNT\ngenerate_scramble\ninverse_moves\n```\n\n## Splits And Curricula\n\n| Split | Depths | Purpose |\n| --- | ---: | --- |\n| `depth1`, `train_depth1`, `eval_depth1` | 1 | One-turn action grounding. |\n| `easy`, `train_easy`, `eval_easy` | 1-3 | Short-scramble generalization. |\n| `medium`, `train_medium`, `eval_medium` | 4-6 | Mid-horizon evaluation. |\n| `hard`, `train_hard`, `eval_hard` | 7-10 | Hard evaluation. |\n| `eval` | 1-10 | Broad mixed-depth evaluation. |\n| `train` | custom, default 1-8 | General curriculum. |\n| `train_candidate_relative_flow_rule_tail_solve_depth2` | 2 | Current v0.2.56 oracle/SFT lane. |\n\nFor ordinary rollouts, the default move budget is `min(32, 2 * depth + 4)`.\nStaged one-action and two-action curricula often override this with a smaller\n`move_budget` to make the protocol exact.\n\n## Simulator Guarantees\n\nThe simulator tracks the Megaminx as facelet state:\n\n- 12 faces: `A` through `L`\n- 132 visible stickers\n- 12 fixed centers\n- 30 edge pieces\n- 20 corner pieces\n- 24 legal moves: each face clockwise or counterclockwise\n- five clockwise turns of any face return to identity\n- clockwise followed by counterclockwise returns to identity\n- every move preserves the sticker multiset\n- generated scrambles plus their inverse solve the puzzle\n\nThe dodecahedron topology is represented through face-neighbor rings and\nprogrammatic side strips. The tests assert the topology and move invariants so\nreward bugs cannot silently become puzzle-physics bugs.\n\n## Reward Styles\n\nThe default dense reward is:\n\n```text\n0.60 * solved\n+ 0.25 * sticker_accuracy\n+ 0.10 * piece_accuracy\n+ 0.05 * efficiency_if_solved\n```\n\nRL-facing rewards are action-gated so tool-free reasoning does not score well.\nThe important families are:\n\n| Reward style | Use |\n| --- | --- |\n| `dense` | General multi-turn simulator reward. |\n| `action_gated_dense` | Dense reward capped at zero until an action is taken. |\n| `action_gated_binary_direction` | Strict depth-1 eval: exactly one clean inverse `rotate` scores `1.0`. |\n| `action_gated_strict_shaped_direction` | Depth-1 training reward with partial credit for learnable wrong-but-informative actions. |\n| `action_gated_candidate_geometry_frontier` | Candidate selection using visible affected geometry. |\n| `action_gated_candidate_strict_frontier` | Candidate-relative flow reward for one-turn-frontier progress. |\n| `action_gated_candidate_path_solve` | Two-call candidate path that rewards solving after refreshed observation. |\n| `action_gated_candidate_path_tail_solve` | Current v0.2.56 hardened two-call reward. |\n\nv0.2.56 caps non-solving second-step tail reward below `0.50`, so a rollout\nmust actually solve to receive a high score.\n\nFull implemented reward-style registry:\n\n```text\ndense\naction_gated_dense\naction_gated_curriculum\naction_gated_overlap\naction_gated_direction\naction_gated_exact_direction\naction_gated_binary_direction\naction_gated_strict_shaped_direction\naction_gated_overlap_strict_shaped_direction\naction_gated_mask_overlap_strict_shaped_direction\naction_gated_counterfactual_frontier_strict\naction_gated_counterfactual_frontier_value_strict\naction_gated_predict_rotate_value_strict\naction_gated_predict_rotate_transition\naction_gated_face_discovery\naction_gated_face_tournament\naction_gated_candidate_tournament\naction_gated_candidate_index\naction_gated_candidate_mask_index_rank\naction_gated_candidate_mask_frontier_equivalence\naction_gated_candidate_geometry_frontier\naction_gated_candidate_strict_frontier\naction_gated_candidate_path_solve\naction_gated_candidate_path_tail_solve\n```\n\n## Prompt Styles\n\nRepresentative prompt styles:\n\n| Prompt style | Tool surface | Description |\n| --- | --- | --- |\n| `default` | `rotate`, `inspect`, `finish` | General text Megaminx puzzle. |\n| `action_first` | `rotate`, `inspect`, `finish` | General puzzle with stronger instruction to act before explanation. |\n| `stage_solve_direction_flow_json_action` | JSON/text `rotate` fallback | Served eval compatibility lane. |\n| `stage_solve_direction_flow_native_tool` | native `rotate` | Depth-1 Hosted Training native tool lane. |\n| `stage_solve_direction_flow_native_tool_v2` | native `rotate` | Depth 1-2 all-candidate solve-flow lane. |\n| `stage_candidate_geometry_frontier_native_tool` | native `select_candidate` | Clean candidate lane built from visible affected geometry. |\n| `stage_candidate_relative_flow_rule_frontier_native_tool` | native `select_candidate` | Candidate-relative flow with explicit counting rule. |\n| `stage_candidate_relative_flow_rule_solve2_native_tool` | native `select_candidate` twice | Current depth-2 path: choose candidate, observe refreshed table, choose again. |\n| `stage_predict_rotate_native_tool` | native `predict_rotate` | Predict local strips before rotating. |\n\nOlder prompt styles remain available for ablations and reproducibility, but the\ncurrent release should be evaluated through the v0.2.56 candidate-path lane.\n\nFull implemented prompt-style registry:\n\n```text\ndefault\naction_first\ndirect_json_action\nchoice_json_action\ntopology_choice_json_action\nsensor_choice_json_action\nsensor_match_json_action\nsensor_indexed_match_json_action\nsensor_candidate_strips_json_action\nstage_face_hint_direction_json_action\nstage_direction_flow_json_action\nstage_direction_flow_reasoned_json_action\nstage_solve_direction_flow_json_action\nnative_action\ntopology_native_tool\nsensor_native_tool\nsensor_match_native_tool\nstage_direction_flow_native_tool\nstage_solve_direction_flow_native_tool\nstage_solve_direction_flow_native_tool_v2\nstage_solve_action_table_native_tool\nstage_solve_action_mask_native_tool\nstage_frontier_sensor_native_tool\nstage_frontier_sensor_compact_native_tool\nstage_predict_rotate_native_tool\nstage_predict_transition_native_tool\nstage_face_discovery_native_tool\nstage_face_tournament_native_tool\nstage_candidate_tournament_native_tool\nstage_candidate_index_native_tool\nstage_candidate_scorecard_native_tool\nstage_candidate_scorecard_no_frontier_native_tool\nstage_candidate_scorecard_mask_native_tool\nstage_candidate_scorecard_mask_index_native_tool\nstage_candidate_scorecard_mask_frontier_equivalence_native_tool\nstage_candidate_geometry_frontier_native_tool\nstage_candidate_relative_flow_frontier_native_tool\nstage_candidate_relative_flow_rule_frontier_native_tool\nstage_candidate_relative_flow_rule_solve2_native_tool\n```\n\n## Protocol Lanes\n\nThere are two distinct protocol lanes:\n\n| Lane | Use | Tool handling |\n| --- | --- | --- |\n| Native Hosted Training | Credible RL measurement | Model emits native `tool_calls`; text before a tool call is a protocol violation. |\n| Served eval compatibility | Historical/scaling evals | JSON/private-text fallback can be parsed into tool actions when native calls are unreliable. |\n\nThese lanes should not be mixed when making claims. Native tool-call metrics\nare the trusted RL surface.\n\n## Metrics\n\nThe environment reports outcome, action, protocol, and diagnostic metrics:\n\n| Group | Examples |\n| --- | --- |\n| Outcome | `solved_rate`, `sticker_accuracy`, `piece_accuracy`, `move_count`, `scramble_depth` |\n| Protocol | `native_tool_call_count`, `text_tool_action_count`, `private_text_action_count`, `tool_parse_error_count`, `tool_call_error_count`, `protocol_violation_count` |\n| Tool counts | `rotate_call_count`, `candidate_select_call_count`, `predict_rotate_call_count`, `inspect_call_count`, `finish_call_count` |\n| First action | `first_rotate_correct`, `first_rotate_face_correct`, `first_rotate_direction_correct`, `first_rotate_neighbor_overlap` |\n| Candidate path | `target_face_in_candidate_set`, `target_candidate_index`, `second_target_candidate_index`, `candidate_path_completed` |\n| Relative flow | `first_candidate_relative_flow_count`, `first_candidate_relative_flow_margin`, `first_candidate_relative_flow_is_candidate_max`, second-step variants |\n| Prediction | `first_prediction_strip_accuracy`, `first_prediction_exact_strip_count`, `first_prediction_char_accuracy`, `first_prediction_valid` |\n\nThese metrics make failures debuggable. For example, a run can have nonzero\nreward but fail because the model called text JSON instead of native tools,\nchose the right face with the wrong direction, selected a visible shortcut, or\nimproved online reward without improving heldout solves.\n\nThe rubric currently includes the primary reward plus sticker, piece, and\nefficiency functions, followed by 70 zero-weight diagnostic metric functions.\nThe exact metric registry lives in `_build_rubric(...)` in\n`megaminx_solver.py`; the groups above are the stable public reading guide.\n\n## Determinism And Hidden Metadata\n\nTask metadata stores:\n\n- scramble\n- inverse solution\n- split\n- example id\n- reward style\n- prompt style\n- hidden candidate seeds\n\nThis metadata is used by tests, rewards, oracle export, and metrics. It is not\nprinted in the current prompts. v0.2.56 specifically removes row-id leakage\nfrom the visible prompt and derives refreshed second-step candidate slots from\nhidden scramble/action metadata.\n\nThe v0.2.56 1,024-row oracle export is byte-stable when re-run with the same\narguments. Recorded SHA256:\n\n```text\n1038afa6958030832c028840dafc22fc3724206461608e2ed809c90fa9695e7b\n```\n\n## Leakage And Shortcut Controls\n\nThis environment went through multiple shortcut audits. Important fixes:\n\n- removed direct answer, scramble, inverse solution, winner, and scalar support\n  fields from prompts\n- separated native Hosted Training from served JSON fallback evals\n- removed public `cw_mask` / `ccw_mask` reward proxies from the clean geometry\n  lane\n- stopped seeding candidate slots directly from hidden solution metadata in the\n  clean lane\n- balanced second-step candidate target slots\n- removed visible example-id dependence from refreshed second-step slots\n- capped high reward for non-solving second-step tail actions\n- added oracle and SFT JSONL validators to check for forbidden payload fields\n\nHistorical scaffolded runs are kept for reproducibility, but v0.2.56 is the\ncurrent hardened baseline.\n\n## Current Evidence\n\n| Claim | Evidence |\n| --- | --- |\n| Package pushed | `setrf/megaminx-solver@0.2.57`, pending Hub push from this docs refresh |\n| Local tests | `uv run pytest -q` recorded `129 passed in 34.88s` |\n| Prompt breakthrough | v0.2.22 35B eval solved `0.929` on 240 heldout examples: <https://app.primeintellect.ai/dashboard/evaluations/vse6uoo8c9y156svyyv41qll> |\n| Clean native RL lane | v0.2.54 two-call candidate-path run reached online reward `0.7335`, solved `0.6615`, two native calls, zero env errors |\n| Best matched hosted heldout gain | v0.2.54 checkpoint improved solved from `0.5625` to `0.6048` on one heldout seed |\n| Local SFT warm-start signal | `Qwen/Qwen3.5-0.8B` local adapter improved from `0/32` to `18/32` heldout solves |\n| Not yet solved | Original `+30pp` hosted RL target was not reached before Prime billing/auth blocked further runs |\n\nThe full evidence trail is in the GitHub report:\n<https://github.com/setrf/megaminx-world-model-bench/blob/main/reports/megaminx-rl-report.md>.\n\n## Usage Recipes\n\nInstall the environment:\n\n```bash\nprime env install setrf/megaminx-solver@0.2.57 --plain\n```\n\nNo environment variables are required to import or load the environment. Hosted\nevals and training still require the normal Prime/model-provider credentials.\n\nFrom a checkout of the GitHub repository, smoke local tests:\n\n```bash\nuv run pytest -q\n```\n\nFrom a checkout of the GitHub repository, create the current oracle dataset:\n\n```bash\nuv run python scripts/export_oracle_trajectories.py \\\n  --num-examples 1024 \\\n  --seed 64 \\\n  --split train_candidate_relative_flow_rule_tail_solve_depth2 \\\n  --output /tmp/megaminx-oracle-v056-1024.jsonl\n```\n\nConvert and validate SFT data:\n\n```bash\nuv run python scripts/convert_oracle_to_sft_jsonl.py \\\n  /tmp/megaminx-oracle-v056-1024.jsonl \\\n  --output /tmp/megaminx-oracle-v056-1024-sft.jsonl\n\nuv run python scripts/validate_sft_jsonl.py \\\n  /tmp/megaminx-oracle-v056-1024-sft.jsonl\n```\n\nCheck readiness before hosted probes:\n\n```bash\nuv run python scripts/check_next_run_readiness.py\n```\n\nThe next hosted probe sequence is documented at:\n<https://github.com/setrf/megaminx-world-model-bench/blob/main/reports/megaminx-next-run-runbook.md>.\n\n## Known Caveats\n\n- The environment is text-first. It models physical structure symbolically; it\n  does not yet render images or vision observations.\n- Some historical runs used scaffolded prompts to discover a learnable\n  curriculum. Use v0.2.56 for the current hardened lane.\n- Native Hosted Training and served evals measure different protocol surfaces.\n- Prime visibility checks still reported `PRIVATE` after public push attempts,\n  despite owner-auth installs and Hub package pushes succeeding.\n- Hosted run creation later hit billing/auth limits, so the prepared v0.2.56\n  matched heldout probes remain to be run.\n\n## Version Notes\n\n| Version | Main change |\n| --- | --- |\n| v0.2.21 | Direction-flow depth-1 eval exposed the face/direction inversion trap. |\n| v0.2.22 | Candidate solving-action framing reached `0.929` solved with 35B on heldout depth-1. |\n| v0.2.28 | Native Hosted Training/TITO lane produced clean native tool-call probes. |\n| v0.2.47 | Candidate mask frontier-equivalence lane showed native checkpoint movement but exposed scaffold risk. |\n| v0.2.50 | Clean geometry candidate lane removed public mask reward proxies. |\n| v0.2.52 | Candidate-relative flow tokens replaced heavier scorecards. |\n| v0.2.53 | Rule-flow prompt added an explicit fair counting rule. |\n| v0.2.54 | Two-call depth-2 candidate path solved after refreshed observations. |\n| v0.2.55 | Tail-solve reward hardening and balanced second-step targets. |\n| v0.2.56 | Visible-id shortcut fix, second-step reward cap, deterministic oracle export, SFT warm-start path. |\n| v0.2.57 | Documentation and package metadata refresh for GitHub and Prime Hub. |\n","encoding":"utf-8","truncated":false,"total_bytes":20395},"status":null}