{"data":{"kind":"file","path":"README.md","version_id":"scgsxlzrpd6ayh7t9u0vdr2j","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3890,"modified_at":"2025-09-07T20:04:25.648000","content_hash":"72b51067c7b76ef5e296afcbd192ffb2dd1066a2d0dc8456fc4fd373766f03c8"},"entries":[],"content":"# nyt_spelling_bee\n\n### Overview\n- **Environment ID**: `nyt_spelling_bee`\n- **Short description**: NYT Spelling Bee word discovery game where models find valid words using 7 letters with one required center letter\n- **Tags**: games, single-turn, reasoning, word-game, puzzles, eval, text, nlp, qa\n\n### Datasets\n- **Primary dataset(s)**: Open source NYT Spelling Bee puzzles (3,377 puzzles)\n- **Source links**: [spelling-bee repository](https://github.com/ConorSheehan1/spelling-bee) \n- **Split sizes**: Configurable via `max_examples` parameter (default: 100 puzzles)\n\n### Task\n- **Type**: single-turn\n- **Parser**: `ThinkParser` (with thinking) or `Parser` (direct) with custom word list extraction\n- **Rubric overview**: Word discovery rate, pangram bonuses, accuracy (precision), and format compliance\n\n### Game Rules\nThe NYT Spelling Bee presents players with:\n- **7 letters**: 6 outer letters + 1 center letter (required in every word)\n- **Minimum word length**: 4 letters\n- **Letter reuse**: Allowed (e.g., \"deed\" uses \"e\" twice)\n- **Goal**: Find as many valid words as possible\n- **Bonus**: Pangrams (words using all 7 letters) earn extra points\n\n### Quickstart\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval nyt-spelling-bee\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval nyt-spelling-bee \\\n  -m gpt-4.1-mini \\\n  -n 20 -r 3 -t 1024 -T 0.7 \\\n  -a '{\"max_examples\": 500, \"use_think\": true}'\n```\n\nTest different configurations:\n\n```bash\n# Direct reasoning (no <think> tags)\nuv run vf-eval nyt-spelling-bee -a '{\"use_think\": false}'\n\n# Fixed puzzle set (no randomization)\nuv run vf-eval nyt-spelling-bee -a '{\"shuffle\": false}'\n\n# Different random seed\nuv run vf-eval nyt-spelling-bee -a '{\"seed\": 123}'\n```\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `max_examples` | int | `100` | Number of puzzles to include in dataset |\n| `use_think` | bool | `True` | Use thinking mode with `<think>` tags (`ThinkParser`) or direct reasoning (`Parser`) |\n| `shuffle` | bool | `True` | Randomize puzzle selection for variety |\n| `seed` | int | `42` | Random seed for reproducible puzzle shuffling |\n\n### Scoring System\n\n**Simplified Penalty-Based Scoring (Range: 0.0 - 1.0)**\n\nThe reward function uses a penalty system with the total number of valid words as the denominator:\n\n- **+2 points**: For each valid pangram (word using all 7 letters)\n- **+1 point**: For each valid word found\n- **-1 point**: For each word with rule violations (only one penalty per word):\n  - Uses letters not in the available set\n  - Doesn't contain the center letter\n  - Less than 4 letters long\n- **0 points**: For words that follow rules but aren't in the valid word list\n\n**Formula**: `reward = max(0, min(numerator, total_valid_words)) / total_valid_words`\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `spelling_bee_reward` | Penalty-based scoring with rule violations (0.0-1.0) |\n\n### Example Interaction\n\n**Input Puzzle:**\n```\nLetters: E-F-I-M-O-S (center: D)\nFind words using these letters. The center letter must be used in every word.\n```\n\n**Expected Response Format:**\n```\n<think>\nI need to find words using D, E, F, I, M, O, S where D must be in every word.\nLet me think of 4+ letter words:\n- deed (uses D, E)\n- demo (uses D, E, M, O)  \n- dime (uses D, I, M, E)\n- dose (uses D, O, S, E)\n...\n</think>\n\n<words>deed, deeds, deem, deemed, demo, demos, died, dies, dime, dimes, dose, dosed</words>\n```\n\n**Scoring Example:**\n```\nLLM Response: deed, demo, modifies, xyz, feds\nValid Words: 50 total in puzzle, including pangram \"modifies\"\n\nAnalysis:\n- deed: +1 (valid word)\n- demo: +1 (valid word)  \n- modifies: +2 (valid pangram - uses all 7 letters)\n- xyz: -1 (uses invalid letters x,y,z)\n- feds: 0 (follows rules but not in valid word list)\n\nNumerator: 1 + 1 + 2 - 1 + 0 = 3\nFinal Score: 3/50 = 0.060\n```\n\n","encoding":"utf-8","truncated":false,"total_bytes":3890},"status":null}