{"data":{"kind":"file","path":"README.md","version_id":"qangqxlvila6ohp28dxvarjm","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3133,"modified_at":"2025-12-18T23:28:01.617000","content_hash":"1942462e90213995eef25e8b913691fdf55864e7c3697997c51353ad40add6b7"},"entries":[],"content":"# Hanabi Mycroft+ Offline Environment\n\nThis variant packages a 4-player Mycroft-style Hanabi dataset for offline training. It keeps the original dynamic environment as a fallback but is optimized for SingleTurn RLVR on the bundled JSONL with move ratings and programmatic deduction supervision.\n\n## Quick Start\n\n```bash\n# 1) Install (from this repo checkout)\nuv sync\nuv run prime env install environments/hanabi_mycroft_plus\n\n# 2) Launch RL (uses the included configs)\nuv run rl \\\n  --trainer @ environments/hanabi_mycroft_plus/examples/prime_rl/train.toml \\\n  --orchestrator @ environments/hanabi_mycroft_plus/examples/prime_rl/orch.toml \\\n  --inference @ environments/hanabi_mycroft_plus/examples/prime_rl/infer.toml \\\n  --wandb.project hanabi_mycroft_plus \\\n  --wandb.name mycroft_plus_run\n```\n\nThe orchestrator already points to `environment.id = \"hanabi_mycroft_plus\"` with `mode = \"mycroft_plus\"` and `use_dataset = true`.\n\n## Dataset (bundled)\n\n`data/hanabi_training_data_4p.jsonl` (585 rows, ~18 MB). Each line has four fields:\n- `user_prompt_training`: full Mycroft-style prompt (including history and legal moves)\n- `response`: reference model answer (kept for inspection; not used for reward)\n- `move_ratings`: map from move index to rating in `[-1, 1]`\n- `programmatic_deduction`: ground-truth deduction block (per player, per card)\n\n## Rewards\n\nOffline runs are SingleTurn. The rubric combines:\n1) **Move rating** — chosen action’s rating transformed by `dataset_reward_mode`  \n   - `clip` (default): `max(0, rating)`  \n   - `raw`: unchanged `[-1, 1]`  \n   - `normalize`: `(rating + 1) / 2`  \n2) **Deduction accuracy** — per-card string match between the model’s `deduction` block and `programmatic_deduction` (whitespace-insensitive).\n\nFinal reward: `(1 - deduction_weight) * move_reward + deduction_weight * deduction_accuracy` with `deduction_weight` defaulting to `0.5`.\n\n## Environment Args\n\n| Argument | Default | Description |\n|---|---|---|\n| `mode` | `\"mycroft_plus\"` | Mycroft-style prompts; aliases to `\"mycroft\"` in dynamic mode |\n| `use_dataset` | `true` | Toggle offline dataset vs. dynamic games |\n| `dataset_path` | packaged JSONL | Override to point at another file with the same columns |\n| `dataset_reward_mode` | `\"clip\"` | Transform move ratings (`clip`, `raw`, `normalize`) |\n| `deduction_weight` | `0.5` | Weight on deduction accuracy (0–1) |\n| `num_games`, `seeds`, `max_turns`, `judge_model`, `engine_random_start_player` | Dynamic mode only |\n\n`dataset_model_names` is accepted for API compatibility but ignored for this dataset.\n\n## Dynamic Mode (optional)\n\nSet `use_dataset=false` to fall back to the original multi-turn Hanabi loop (Watson/Sherlock/Mycroft prompts, live judge support). `mode=\"mycroft_plus\"` is treated the same as `mode=\"mycroft\"` there.\n\n## What to Watch\n\n- The deduction reward expects a complete `deduction` JSON block; missing cards/players lower the score.\n- Move ratings are keyed by move index; ensure the parser extracts `action` correctly.\n- The bundled dataset is 4-player; dynamic runs can use any player count, but offline prompts assume four players.\n","encoding":"utf-8","truncated":false,"total_bytes":3133},"status":null}