{"data":{"kind":"file","path":"README.md","version_id":"yhuaerloh8yhzokatefj8b6i","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4772,"modified_at":"2026-03-09T18:40:47.487000","content_hash":"45d8f1420caaf3a179f0da74d7bbc6cfa520c477cf228d39f9522b1f7363e733"},"entries":[],"content":"# connections\n\n### Overview\n- **Environment ID**: `lswamina/connections`\n- **Short description**: NYT Connections word-grouping puzzle game — find 4 groups of 4 related words from 16 shuffled words.\n- **Tags**: multi-turn, game, word-puzzle, reasoning, train, eval\n\n### Datasets\n- **Primary dataset**: 915 NYT Connections puzzles (June 2023 – December 2025), scraped from public puzzle archives.\n- **Synthetic dataset**: 107 additional puzzles generated via a two-stage LLM pipeline (see below).\n- **Split sizes**: ~735 train puzzles (pre-2025, 4 shuffles each → 2940 examples) / 365 eval puzzles (2025 puzzles, 1 shuffle each)\n- **Eval split**: held-out 2025 puzzles only — never seen during training.\n\n### Synthetic Puzzle Generation\n\nTo augment the training set beyond the ~735 real pre-2025 puzzles, a generate → verify → fix pipeline was built:\n\n1. **Generate** (`scripts/generate_puzzles.py`): Claude Haiku generates candidate puzzles few-shot from real examples drawn from the existing dataset (RAG-style context). Each candidate has 4 groups of 4 words across Yellow/Green/Blue/Purple difficulty levels, with a descriptive category name and an explanation of what connects each group.\n\n2. **Verify** (`scripts/generate_puzzles.py` + `scripts/verify_puzzles.py`): Claude Sonnet validates each generated puzzle, checking each group for:\n   - **Clarity** — the category name unambiguously describes the connection\n   - **Exclusivity** — no word plausibly belongs to multiple groups\n   - **Accuracy** — all words genuinely fit the stated category (including wordplay/prefix/suffix categories)\n   - Structural checks: exact 4 words per group, no duplicates, no overlap with existing puzzles\n\n3. **Fix** (`scripts/fix_puzzles.py`): Puzzles that fail verification are repaired by Claude Sonnet (bad groups swapped out, words corrected) rather than discarded, improving yield.\n\nPassing puzzles are appended to `synthetic_puzzles.csv` and automatically included in training builds.\n\n### Task\n- **Type**: multi-turn\n- **Output format**: One guess per turn inside XML tags: `<guess>WORD1, WORD2, WORD3, WORD4</guess>`, preceded by a brief reasoning in `<reason>` tags.\n- **Max turns**: 16 (allows up to 8 full guess rounds)\n- **Rubric overview**:\n  - `difficulty_weighted_reward` (primary): sum of `(level + 1) / 10` for each found group, minus 0.1 per mistake. Max = 1.0 (all 4 groups, 0 mistakes). Yellow=0.1, Green=0.2, Blue=0.3, Purple=0.4.\n  - `mistakes_used_metric`: number of mistakes made (0–4)\n  - `groups_found_metric`: integer count of groups found (0–4)\n  - `avg_difficulty_solved_metric`: average difficulty level of solved groups\n  - `filter/gibberish`: fraction of responses flagged as gibberish\n  - `filter/repetition`: fraction of responses flagged as repetitive\n\n### Game Rules\n- 16 words form exactly 4 groups of 4 related words\n- Difficulty levels: Yellow (0, easiest) → Green (1) → Blue (2) → Purple (3, trickiest)\n- 4 mistakes allowed before game over\n- Environment responds with one of:\n  - `\"Correct! [Category]\\nRemaining words (N): ...\"` on a correct guess\n  - `\"Incorrect. X mistakes remaining.\\nCurrent words (N): ...\"` on a wrong guess\n  - `\"Incorrect. One away! X mistakes remaining.\"` when 3 of 4 guessed words match a real group\n  - `\"Congratulations! You found all 4 groups in M mistakes. Puzzle solved!\"` on win\n  - `\"Game over! You found K/4 groups.\"` on loss\n\n### Quickstart\n\n```bash\nprime eval run lswamina/connections -m gpt-4.1-mini -n 20 -r 3\n```\n\nEval split only:\n```bash\nprime eval run lswamina/connections -m gpt-4.1-mini -n 20 -a '{\"split\": \"eval\"}'\n```\n\n### Environment Arguments\n\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `split` | str | `\"train\"` | `\"train\"` (pre-2025 puzzles + synthetic) or `\"eval\"` (2025 puzzles only) |\n| `num_examples` | int | `-1` | Limit dataset size (-1 = all) |\n| `seed` | int | `42` | Random seed for word shuffling |\n\n### Metrics\n\n| Metric | Meaning |\n| ------ | ------- |\n| `difficulty_weighted_reward` | Primary reward: difficulty-weighted group score minus mistake penalty, max 1.0 |\n| `mistakes_used_metric` | Mistakes made this game (0–4) |\n| `groups_found_metric` | Number of groups correctly identified (0–4) |\n| `avg_difficulty_solved_metric` | Avg difficulty level of solved groups (0=Yellow … 3=Purple) |\n| `num_turns` | Total turns taken |\n| `filter/gibberish` | Fraction of responses that are gibberish |\n| `filter/repetition` | Fraction of responses with excessive repetition |\n\n### Baseline Results\n\nEvaluated on 100 eval-split puzzles with `max_tokens=16384`:\n\n| Model | Reward | Win Rate | Groups Found |\n| ----- | ------ | -------- | ------------ |\n| Qwen3-30B-A3B-Thinking-2507 (base) | 0.57 | 61% | 2.9 / 4 |\n","encoding":"utf-8","truncated":false,"total_bytes":4772},"status":null}