{"data":{"kind":"file","path":"README.md","version_id":"diw2huyvnmxducbu08f6nb6t","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":8562,"modified_at":"2025-09-16T22:32:56.771000","content_hash":"bf8180baa8931207e7086c1935c994952f9615e2c3210946e2d9437da70f3066"},"entries":[],"content":"# Mastermind\n\n**Source Code Repository:** [https://github.com/peterN908/mastermind](https://github.com/peterN908/mastermind)\n\nA mastermind game simulator. Enables evaluating models in a multi-step (full playthrough), and single-step (making a single move) environment. Here the model goes back and forth guessing the code and receiving feedback on how many digits are present or correct.\n\n### The concept\n\nI was interested in this initially because its easy to tune the complexity of the task by adjusting the Code length (L), Alphabet size (K) and history.\n\nDifferent setups are available but the main idea is that we reward the model based on how close to the \"perfect\" decision it makes (based on information gain) in a single-step environment.\n\nFinally we evaluate on full multi-step playthroughs. \n\n\n### Results\n\nThis turned out to be a very interesting way to discriminate model capabilities. Qwen Next 80B A3B was shockingly good with a 54% success rate - twice as high as the next best (also Qwen).\n\nOn Qwen 2.5-7B the best way to improve performance seems to be to train on the same difficulty you are assessing on. After ~300 steps GRPO the success rate increases from 0.005% to 2% in our standard L=4, K=6 multi-step eval.  \n\nIf you look at how performance varies with Code size (L) and Alphabet Size (K) and History (H) - in general it appears that Code size is the only knob that predictably increases the difficulty (as measured by relative information game). This is because when the Alphabet Size is large, or history is small - its very easy to make a perfect information gaining move by just predicting a code that hasnt been seen yet. \n\n \n### Overview\n- **Environment ID**: `mastermind`\n- **Short description**: Single‑turn and multi‑turn Mastermind. Single‑turn rewards closeness to the best information‑gain move; multi‑turn evaluates speed to solve.\n- **Tags**: mastermind, games, single-turn, multi-turn, eval, rl\n\n### Datasets\n- **Primary dataset**: Programmatically generated prompts with optional prior history (guess/feedback pairs). Histories are derived from a hidden code to ensure consistency.\n- **Size**: Controlled by `max_examples` (defaults to 200 if unspecified). Randomness is seeded via `seed`.\n\n### Task\n- **Types**: single‑turn (chat) and multi‑turn solve (chat)\n- **Parser**: `XMLParser` expecting `<answer>GUESS: a b c d</answer>`\n- **Rubric overview**:\n  - Single‑turn: ig_relative (default) or ig/elim; plus a small format bonus\n  - Solve: speed to solve (normalized), plus solved/turns metrics\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval mastermind\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval mastermind \\\n  -m gpt-4.1-nano \\\n  -n 50 -r 1 -t 2048 -s \\\n  -a '{\"mode\":\"solve\",\"L\":4,\"K\":6,\"allow_repeats\":true,\"seed\":99,\"curriculum\":false}'\n```\n\nNotes:\n- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n- The model must reply with the guess inside `<answer>...</answer>` as `GUESS: a b c d`.\n\n### Multi‑turn Solve Mode (default)\nRun a full Mastermind game until solved (or turn cap):\n\n```bash\nuv run vf-eval mastermind -a '{\"mode\":\"solve\", \"L\":4, \"K\":6, \"allow_repeats\":true, \"solve_max_turns\":12}'\n```\n\nMetrics in solve mode:\n- `reward` (speed): normalized in (0,1] when solved (faster → higher), else 0.\n- `solved_metric`: 1 if solved, 0 otherwise.\n- `turns_metric`: number of valid guesses taken to solve.\n\nNotes:\n- Default mode is `solve`. For single-turn scoring, use `-a '{\"mode\":\"single\", ...}'`.\n- Default single-turn reward is ig_relative (closeness to best IG). Use `-a '{\"reward_mode\":\"ig\"}'` for raw IG.\n\n### Random Curriculum\nEnable per‑example randomization of difficulty in single‑turn mode:\n\n```bash\nuv run vf-eval mastermind \\\n  -m gpt-4.1-mini \\\n  -n 20 -r 1 -t 512 -T 0.5 -s \\\n  -a '{\"mode\":\"single\", \"curriculum\": {\"mode\": \"random\"}}'\n```\n\nDetails:\n- Randomizes `(L, K, history_len)` per example within bounds.\n- Optional bounds: `{L_min,L_max,K_min,K_max,H_min,H_max}`. If `k_equals_l_plus_2=true`, tie `K` to `L+2` within bounds.\n\nCurriculum modes (single‑turn):\n- `random`: Per‑example randomize L/K/H within bounds; optional `k_equals_l_plus_2=true` ties K to L.\n\nPerformance caps for `ig_relative` (to avoid O(|S_H|^2) blowups):\n- `relative_sh_cap` (default 5000): cap the consistent set `S_H` used for scoring.\n- `relative_candidate_cap` (default 1000): cap the number of candidate guesses used to find the best IG in the denominator.\n- Tip: lowering these caps speeds up runs but makes ig_relative noisier; raise for fidelity if compute allows.\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `L` | int | `4` | Code length |\n| `K` | int | `6` | Alphabet size (symbols `0..K-1`) |\n| `allow_repeats` | bool | `true` | Allow repeated symbols in codes/guesses |\n| `reward_mode` | str | `\"ig_relative\"` | `\"ig_relative\"` (ratio to best IG), `\"ig\"` (information gain, bits), or `\"elim\"` (normalized elimination) |\n| `max_space_enum` | int | `200000` | Enumerate all codes if `K^L <= max_space_enum`, else sample |\n| `sample_n` | int | `10000` | Sample size when approximating large code spaces |\n| `history_len` | int | `0` | Number of random (guess, feedback) items to include in the prompt |\n| `history` | list | `null` | If provided, a fixed history overrides `history_len` (list of `{guess:[...], feedback:[b,w]}`) |\n| `max_examples` | int | `200` | Number of examples to generate |\n| `seed` | int | `null` | Seed for reproducibility |\n| `use_think` | bool | `true` | If true, allow `<think>...</think>` before `<answer>` in prompt guidance |\n| `mode` | str | `\"solve\"` | `\"single\"` (one guess, scored) or `\"solve\"` (multi‑turn to solution) |\n| `solve_max_turns` | int | `12` | Max valid guesses in solve mode |\n| `relative_pool` | str | `\"consistent\"` | Candidate pool for `ig_relative`: `consistent` or `all` (sampled if large) |\n| `relative_sh_cap` | int | `5000` | Cap on consistent set size for `ig_relative` scoring |\n| `relative_candidate_cap` | int | `1000` | Cap on number of candidate guesses for best-IG search |\n| `curriculum` | bool/dict | `false` | Enable per‑example randomization in single‑turn mode; dict accepts `{mode:\"random\",L_min,L_max,K_min,K_max,H_min,H_max,k_equals_l_plus_2}` |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Main scalar (ig_relative/ig/elim or speed) |\n| `format_reward` | Format bonus for correct `<answer>` tag usage |\n| `sh_size_metric` | Size of the consistent set `|S_H|` (logged, weight 0) |\n\nReward computation (`reward_mode`):\n- `elim`: `1 - sum_y p_y^2`, where `p_y` is the fraction of consistent codes that would yield feedback `y` for the guess.\n- `ig`: `-sum_y p_y log2 p_y` (bits). For very large spaces, both use an empirical estimate over `sample_n` draws.\n- `ig_relative`: `IG(guess) / max_g IG(g)` over a candidate pool. Defaults to the consistent set; change with `relative_pool`.\n\nAdditional args for `ig_relative`:\n- `relative_pool`: `\"consistent\"` (default) to compare against the best consistent code; or `\"all\"` to compare against the best code in the entire space (sampled when large).\n\n### Prompt Format\nEach prompt succinctly states the alphabet, code length, whether repeats are allowed, and any prior (guess,feedback) history. The model must respond with:\n\n```\n<think>Optional short reasoning</think>\n<answer>GUESS: a b c ...</answer>\n```\n\nWhere `a..d` are integers in `0..K-1` and count equals `L`.\n\n### Example Prompt (L=4, K=6)\n\n```\nTASK: Mastermind (single-turn)\nAlphabet: 0..5 (K=6), Code length: 4\nHistory count: 1\nHistory:\n- Guess 0 1 2 3 -> feedback b=1, w=1\nCONSTRAINTS: repeats allowed\nACTION FORMAT: Put the guess inside <answer>...</answer> exactly as 'GUESS: a b c d'.\n(The example adapts to L — for L=3 it will show 'GUESS: a b c'.)\n```\n\n### Evaluation Reports\nSaved reports generated by `vf-eval` will auto-render here when published to a static site.\n\n\n### Scripts\n\n- CLI play (single-turn scoring):\n  - `mastermind-play` — interactive prompt with optional top‑K suggestions by reward.\n  - Example: `uv run mastermind-play --L 4 --K 6 --history-len 1 --suggest-top 5`\n\n- Grid analysis (ig_relative heatmaps and raw completions):\n  - `python environments/mastermind/analysis/mastermind_grid_eval.py -m gpt-4.1-nano -n 10 -r 1 --L-min 3 --L-max 7 --K-min 3 --K-max 9 --H-min 0 --H-max 5 --save-completions`\n  - Flags: `--rel-sh-cap`, `--rel-cand-cap` to bound compute; `--no-plot` to skip heatmaps.\n","encoding":"utf-8","truncated":false,"total_bytes":8562},"status":null}