{"data":{"kind":"file","path":"README.md","version_id":"bqhq9hw5ixr9qxglhgrummhi","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3558,"modified_at":"2026-03-04T16:15:54.021000","content_hash":"a2fa2c38e0e5db07dd9286865593f5c0b34a14a20b74d5160bdce56e5ea9bb18"},"entries":[],"content":"# twenty-questions\n\n### Overview\n- **Environment ID**: `twenty-questions`\n- **Description**: A multi-turn 20 Questions game. An LLM agent identifies a secret English noun by asking up to 20 yes/no questions, answered by an LLM oracle. Wrong guesses count as a turn. Correct guesses end the episode with a reward scaled by efficiency.\n\n### Dataset\n- **Pool**: 8,155 English nouns from WordNet, filtered by Brysbaert concreteness ratings and Zipf word frequency\n- **Tiers** (percentile-based on difficulty, ascending — tier 1 = easiest):\n  - Tier 1 — easiest 1%: ~81 words — baby, eyes, card, road, door, daughter, neck, engine, bill, moon\n  - Tier 2 — 1st–5th pct: ~326 words — mother, branch, cotton, diamond, horn, tower, oven, chick, knight, pope\n  - Tier 3 — 5th–10th pct: ~408 words — fossil, shorts, flood, canyon, elevator, bulb, coconut, vaccine\n  - Tier 4 — 10th–20th pct: ~816 words — rhino, condo, cologne, dairy, electronics, lasagna, swimsuit\n  - Tier 5 — rest (> 20th pct): ~6,524 words — machete, psychiatrist, wasabi, alternator, underpass, featherweight\n\n### Task\n- **Parser**: `XMLParser` with `<question>` and `<guess>` fields\n- **Oracle**: separate LLM that knows the secret word and answers each question with `Yes | No | Sometimes | Unclear`\n- **Episode end**: correct guess OR 20 turns exhausted\n\n### Reward\n```\ncorrect guess at turn t:  1.0 + 0.5 * (20 - t) / 19   →  [1.0, 1.5]\nwrong guess / timeout:    0.0\n```\nReward is **sparse and episode-level only** — no per-turn shaping. This makes credit assignment harder and is intentional for RL training signal quality.\n\n### Quickstart\n```bash\nprime eval run twenty-questions\n```\n\nPlayer and oracle are **independently configurable** — use any model for each:\n```bash\n# Same model for both (OPENAI_API_KEY used for player and oracle)\nprime eval run twenty-questions \\\n  -m gpt-4.1-mini \\\n  -a '{\"tier\": 1, \"oracle_model\": \"gpt-4.1-mini\"}'\n\n# Different models: strong oracle, weaker player being trained\nprime eval run twenty-questions \\\n  -m gpt-4o-mini \\\n  -a '{\"tier\": 1, \"oracle_model\": \"gpt-4.1\"}'\n\n# Local player (vLLM/LM Studio), hosted oracle\nprime eval run twenty-questions \\\n  -m my-local-model \\\n  -b \"http://localhost:8000/v1\" \\\n  -a '{\"tier\": 2, \"oracle_model\": \"gpt-4.1-mini\", \"oracle_base_url\": \"https://api.openai.com/v1\"}'\n```\n\n### Environment Arguments\n| Argument | Type | Default | Description |\n|---|---|---|---|\n| `tier` | int | `1` | Word difficulty tier (1–5): 1=easiest (~81 words), 5=hardest (~6,524 words) |\n| `oracle_model` | str | `\"gpt-4.1-mini\"` | Oracle LLM — answers yes/no questions. Independent of player model (`-m`). |\n| `oracle_base_url` | str | `\"https://api.openai.com/v1\"` | Oracle API base URL (use any OpenAI-compatible endpoint) |\n| `oracle_api_key_var` | str | `\"OPENAI_API_KEY\"` | Env var holding the oracle API key |\n| `num_train_examples` | int | `2000` | Words sampled for training |\n| `num_eval_examples` | int | `50` | Words sampled for evaluation |\n| `system_prompt` | str | `DEFAULT_SYSTEM_PROMPT` | System prompt for the player agent |\n| `seed` | int | `0` | Random seed |\n\n### Metrics\n| Metric | Meaning |\n|---|---|\n| `reward` | Episode reward (0.0 or 1.0–1.5) |\n| `win_rate` | Fraction of episodes with correct guess |\n| `avg_questions` | Average turns used per episode |\n| `efficiency_bonus` | Average `reward - 1.0` for wins (0 = used all 20 questions, 0.5 = guessed on turn 1) |\n\n### See Also\n[MiniMax plays 20 Questions](https://huggingface.co/spaces/echoboi/minimax2-1-plays-20-questions)\n","encoding":"utf-8","truncated":false,"total_bytes":3558},"status":null}