{"data":{"kind":"file","path":"README.md","version_id":"dgxtdstzx1iovkovp9xkcc9x","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":10381,"modified_at":"2026-03-16T08:41:21.845000","content_hash":"9609bc05aa4d2e55638ff7541629157bedde31140bb6152f7f55179d8f5face5"},"entries":[],"content":"# hangman_agent\n\nMinimal multi-turn Hangman for Prime/Verifiers. Each rollout starts from a fully hidden English word, shows the model a compact text board, and requires exactly one guess per turn via an OpenAI-style `suggest_letter(letter: str)` tool call.\n\n## Overview\n\n- Environment id: `hangman_agent`\n- Package path: `environments/hangman_agent/`\n- Base class: custom `vf.ToolEnv`\n- Local data: bundled 10000-word TSV lexicon tagged with `easy` / `medium` / `hard` (`hangman_agent/data/lexicon.tsv`), rebuilt with `scripts/build_lexicon.py`\n- Default dataset size: 128 train examples and 128 eval examples per resolved config\n- Default difficulty: `easy` for development-focused iteration\n- Package version: `0.2.11`\n\nRebuild the lexicon with:\n\n```bash\nuv run --with wordfreq python environments/hangman_agent/scripts/build_lexicon.py\n```\n\n## Task Contract\n\nThe assistant may emit optional reasoning text, but it must call `suggest_letter(letter: str)` exactly once per turn. The system prompt asks for minimal reasoning and the first user message is only the board state.\n\nRules:\n\n- `letter` must be exactly one ASCII alphabetic character\n- missing tool calls, malformed tool arguments, and non-letter payloads receive a small deterministic penalty\n- invalid tool usage repeats the same board with explicit feedback and does not change the board\n- repeated guesses are accepted but count as wasted wrong guesses; the feedback says whether that letter was already known correct or wrong\n- reward never depends on assistant free-text content\n\nThe model sees a plain-text board like:\n\n```text\nword: _ P P _ E\nwrong letters: B, C, D, I, M\nhanged: 42%\n```\n\nThe hidden word is only revealed after termination.\n\n## Termination And Reward\n\nThe episode ends on the first of these conditions:\n\n- the word is fully revealed\n- the hang reaches 100%\n- too many invalid tool actions occur in one rollout\n\nTurn reward has four components:\n\n- fresh valid letter guess: `0.0` by default\n- progress reward: fraction of initially hidden positions revealed by the current guess\n- solved reward: `1.0` if the word is fully uncovered at termination, else `0.0`\n- invalid action penalty: `-0.05` for a malformed or missing tool call on a turn\n\nThat means solved rollouts can still reach `2.0` from `1.0` progress plus `1.0` solved reward, while repeated guesses give no bonus and invalid tool usage incurs the small penalty above.\n\nPer-turn reward components are attached to trajectory extras for lightweight debugging.\n\n## Generation\n\nTasks are generated deterministically from a curated local English lexicon with an explicit difficulty tag on each word. Every game starts from a fully hidden word with no pre-filled correct or wrong letters.\n\nGeneration controls:\n\n- difficulty tags (`easy`, `medium`, `hard`)\n- fixed wrong-guess budget of `12` per rollout\n- optional `difficulty_mix` weights in `easy,medium,hard` order for mixed-difficulty datasets\n- deterministic `seed`\n\nFor mixed datasets, the generator allocates exact per-difficulty counts from the requested mix, samples directly from the tagged lexicon pools, and shuffles the merged records.\n\nEach generated state is validated to ensure the board is not already solved.\n\n## Quickstart\n\nInstall the local environment:\n\n```bash\nprime env install hangman_agent\n```\n\nRun a local smoke load:\n\n```bash\nuv run python -c \"from hangman_agent import load_environment; env = load_environment(difficulty='easy', seed=1, num_examples=2); print(type(env).__name__)\"\n```\n\n## Running Evals\n\n### Hosted eval with `gpt-4.1-mini`\n\nFrom the workspace root, load your keys and run a small smoke eval against OpenAI:\n\n```bash\nset -a; source secrets.env >/dev/null 2>&1\n\nprime eval run hangman_agent \\\n  -m gpt-4.1-mini \\\n  -b https://api.openai.com/v1 \\\n  -k OPENAI_API_KEY \\\n  -n 6 \\\n  -r 2 \\\n  -a '{\"difficulty_mix\":[0.3,0.4,0.3]}' \\\n  -C 'termination_reason,last_outcome,total_reward,rollout_trace' \\\n  -s \\\n  --skip-upload \\\n  -d\n```\n\nThis uses the mixed-difficulty generator added in `0.2.x` and saves full rollout traces locally.\n\nIf you prefer a single preset, swap the env args for something like `-a '{\"difficulty\":\"easy\"}'`.\n\n### Eval with a locally hosted model\n\nFor local testing, use the helper script. It starts the model server, waits for it to become ready, runs `prime eval run`, and then shuts the server down.\n\nRecommended smoke eval with `mlx-lm`:\n\n```bash\nLOCAL_LLM_API_KEY=dummy \\\nuv run python -m hangman.local_eval \\\n  --backend mlx-lm \\\n  --model mlx-community/Qwen3.5-0.8B-MLX-4bit \\\n  --difficulty easy \\\n  --num-examples 6 \\\n  --rollouts-per-example 2 \\\n  --max-concurrent 1\n```\n\nNotes:\n\n- `LOCAL_LLM_API_KEY=dummy` is only there because the Prime CLI expects an API key variable even for local servers.\n- The first `mlx-lm` run can take a while because it may need to download model weights before `/v1/models` is ready.\n- To test a mixed dataset instead of a single preset, replace `--difficulty easy` with `--difficulty-mix '[0.3, 0.4, 0.3]'`.\n\nIf you already have an OpenAI-compatible local server running, use the regular eval command instead.\n\nExample with a local vLLM server:\n\n```bash\nexport LOCAL_VLLM_MODEL=\"Qwen/Qwen3-30B-A3B-Instruct-2507\"\nexport LOCAL_VLLM_BASE_URL=\"http://127.0.0.1:8000\"\nexport LOCAL_VLLM_API_KEY=\"token\"\n\nprime eval run configs/eval/hangman-vllm.toml\n```\n\nThat config reads `configs/endpoints.vllm.py`, which normalizes a bare host like `http://127.0.0.1:8000` to `/v1` for you.\n\nIf you want the fully explicit one-off command instead of the config file:\n\n```bash\nprime eval run hangman_agent \\\n  -m \"$LOCAL_VLLM_MODEL\" \\\n  -b \"http://127.0.0.1:8000/v1\" \\\n  -k LOCAL_VLLM_API_KEY \\\n  -n 6 \\\n  -r 2 \\\n  -a '{\"difficulty_mix\":[0.3,0.4,0.3]}'\n```\n\n`mlx-lm` is the easiest local backend in this workspace because it is already included in the root project dependencies. The helper also supports `--backend vllm` if the `vllm` CLI is installed in your environment.\n\n### Full eval to Hub\n```bash\nset -a; source secrets.env >/dev/null 2>&1\n\nuv run prime eval run hangman_agent \\\n  --model gpt-4.1-mini \\\n  --api-base-url https://api.openai.com/v1 \\\n  --api-key-var OPENAI_API_KEY \\\n  --num-examples 25 \\\n  --rollouts-per-example 4 \\\n  --env-args '{\"difficulty_mix\":[0.3,0.4,0.3]}' \\\n  --state-columns 'termination_reason,last_outcome,total_reward,rollout_trace' \\\n  --save-results \\\n  --tui\n\nRUN_DIR=\"$(ls -td environments/hangman_agent/outputs/evals/hangman_agent--gpt-4.1-mini/* | head -n1)\"\nprime eval push \"$RUN_DIR\"\n\n# push env again to hub\n# prime env push --path environments/hangman_agent --visibility PRIVATE\n```\n\n## Inspecting Rollouts\n\nSaved eval outputs include the full prompt/completion conversation in `results.jsonl`, including assistant `tool_calls`, tool responses, and the board updates for each turn. The most useful state columns are `termination_reason`, `last_outcome`, `total_reward`, and `rollout_trace`.\n\nAfter a saved eval, inspect the newest output directory:\n\n```bash\nfind environments/hangman_agent/outputs/evals \\( -name results.jsonl -o -name metadata.json \\) | tail\n```\n\nOpen the rollout file directly:\n\n```bash\nless environments/hangman_agent/outputs/evals/.../results.jsonl\n```\n\nOr extract a compact summary with `jq`:\n\n```bash\njq '{reward, termination_reason, last_outcome, total_reward}' environments/hangman_agent/outputs/evals/.../results.jsonl\n```\n\nIf you want the richest traces, make sure your eval command includes:\n\n```bash\n-C 'termination_reason,last_outcome,total_reward,rollout_trace' -s\n```\n\nThat adds rollout-level state fields to the saved records so you can inspect how the board evolved turn by turn.\n\n## Environment Arguments\n\n`load_environment(...)` accepts these knobs:\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `difficulty` | `str` | `\"easy\"` | High-level preset for generation constraints. |\n| `difficulty_mix` | `Sequence[float] \\| str \\| None` | `None` | Optional easy/medium/hard mixture weights. A value like `[0.3, 0.4, 0.3]` yields a dataset with that proportional split, normalized if needed. This cannot be combined with manual generation-range overrides. |\n| `seed` | `int` | `0` | Base seed for deterministic train/eval task generation. |\n| `num_examples` | `int` | `128` | Number of examples to generate per split. |\n| `word_length_min` / `word_length_max` | `int \\| None` | preset | Override word-length range. |\n| `frequency_tiers` | `Sequence[str] \\| str \\| None` | preset | Allowed lexicon tiers. |\n| `repeat_density_min` / `repeat_density_max` | `float \\| None` | preset | Override repeated-letter density range. |\n| `allowed_attempts_min` / `allowed_attempts_max` | `int \\| None` | preset | Override the wrong-guess budget before the hang reaches 100%. |\n\n## Metrics\n\nThe rubric emits:\n\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Total rollout reward, computed from the accumulated turn rewards. |\n| `solved_metric` | `1.0` when the puzzle is solved, else `0.0`. |\n| `invalid_outputs_metric` | Count of invalid or missing tool actions. |\n| `repeated_guesses_metric` | Count of repeated guesses. |\n| `positions_revealed_metric` | Total number of positions revealed over the rollout. |\n\n## Local Validation\n\nCompleted on 2026-03-10:\n\n```bash\nprime env install hangman_agent\nuv run python -m unittest discover -s environments/hangman_agent/tests -v\nset -a; source secrets.env >/dev/null 2>&1; prime eval run hangman_agent -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY -n 6 -r 2 -a '{\"difficulty\":\"easy\"}' -C 'termination_reason,last_outcome,total_reward' -s --skip-upload -d\nset -a; source secrets.env >/dev/null 2>&1; prime eval run hangman_agent -m gpt-5-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY -n 6 -r 2 -a '{\"difficulty\":\"hard\"}' -C 'termination_reason,last_outcome,total_reward' -s --skip-upload -d\n```\n\nNote: `prime eval run ... -e configs/endpoints.toml -m <endpoint_id>` did not resolve the OpenAI endpoint aliases in this session and instead fell back to the default Pinference base URL. The successful smoke evals therefore passed `-b https://api.openai.com/v1 -k OPENAI_API_KEY` explicitly.\n\nThe local vLLM helper added in `configs/endpoints.vllm.py` was verified by loading the endpoint registry directly, but no local vLLM server was available in this session to run a live eval against `http://127.0.0.1` or another self-hosted endpoint.\n","encoding":"utf-8","truncated":false,"total_bytes":10381},"status":null}