{"data":{"kind":"file","path":"README.md","version_id":"wy6x8beu3rvbwoqw0wm9j6nc","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4543,"modified_at":"2026-05-11T02:48:10.326000","content_hash":"3d48392153d7ba4ef200c244d5da1311ba00ca3915d61e9f867e06b52b33d2a4"},"entries":[],"content":"# connections\n\n### Overview\n\n- **Environment ID**: `connections`\n- **Short description**: Multi-turn word puzzle game where players find groups of related items.\n- **Tags**: puzzles, word-games, multi-turn, reasoning\n\n### Dataset\n\n- **Primary dataset**: [`ericbotti/connections-puzzles`](https://huggingface.co/datasets/ericbotti/connections-puzzles)\n- **Split sizes**: Train (RL): 7,554 puzzles, Test: 981 puzzles\n\nThe dataset includes puzzles scraped from PuzzGrid covering a variety of topics, grid sizes, and difficulty levels. \n\n### Task\n\n- **Type**: multi-turn, tool-calling\n- **Tool**: `guess(items: list[str])` — submit one guess per turn\n- **Stop conditions**: `max_mistakes_reached` (4 mistakes), `all_categories_found`, plus base-harness `max_turns_reached` / `prompt_too_long`\n- **Ruleset**: NYT only — 4 max mistakes counting from the start, one-away hints enabled, themes revealed immediately on correct guess. \n\n### Quickstart\n\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval connections\n```\n\nConfigure model, sampling, and example count:\n\n```bash\nuv run vf-eval connections --model gpt-4.1-mini --num-examples 20 --rollouts-per-example 3 --max-tokens 1024 --temperature 0.7\n```\n\n### Environment Arguments\n\nPass via `--extra-env-kwargs '{...}'`:\n\n| Arg                            | Type | Default | Description                                                              |\n| ------------------------------ | ---- | ------- | ------------------------------------------------------------------------ |\n| `max_turns`                    | int  | `10`    | Maximum model turns per game                                             |\n| `is_dataset_raw_puzzles`       | bool | `true`  | If true, run `prep_dataset` on the training set                          |\n| `is_eval_dataset_raw_puzzles`  | bool | `true`  | If true, run `prep_dataset` on the eval set                              |\n| `system_prompt`                | str  | `None`  | Override the generated system prompt                                     |\n\n`dataset` and `eval_dataset` can be passed programmatically when constructing the env in Python; they default to the train_rl and test splits of `ericbotti/connections-puzzles`.\n\n### Rewards\n\n| Reward                              | Weight | Description                                                |\n| ----------------------------------- | ------ | ---------------------------------------------------------- |\n| `valid_guesses`                     | 0.5    | Proportion of guesses that passed validation               |\n| `almost_found_categories`           | 0.5    | Count of \"one away\" guesses for categories never found     |\n| `found_categories`                  | 4.0    | Proportion of categories found (0.0–1.0)                   |\n| `efficiency_bonus`                  | 1.0    | Rewards fewer manual guesses to find all categories        |\n\n### Key behaviors\n\n- **One guess per turn (enforced).** The taskset includes a `@vf.setup` handler that defaults `sampling_args.parallel_tool_calls = False` if neither the harness nor the runner specifies it. vLLM and OpenAI both honor this — vLLM by post-hoc truncation, OpenAI by suppressing parallel emissions at sampling time. Keeps the RL credit-assignment clean (one action → one reward) and prevents games from ending in a single turn when a model emits N parallel mistakes.\n- **Auto-completion.** When all but one category has been found, the last is auto-completed and recorded in `guess_history` with `status=\"auto\"`.\n- **Resume / doctoring.** v1-native: callers construct a `State` with populated `mistakes` / `found_categories` / `remaining_items` / `guess_history` and pass it to `harness.run(task, state)`, bypassing the `init_game_state` setup.\n- **BYO harness.** The `parallel_tool_calls=False` default lives on the taskset (via `@vf.setup`), so any harness composed with `load_taskset()` inherits it: `vf.Env(taskset=load_taskset(), harness=your_harness)`.\n\n### Architecture\n\n```\nconnections/\n├── environment.py    # source generators, guess tool, @vf.setup, @vf.stop, load_environment factory\n├── rubric.py         # @vf.reward-decorated reward functions, REWARDS list\n├── dataset.py        # prep_dataset — converts raw HF rows to env-shaped tasks\n├── prompts/          # system prompt + game-start prompt template\n├── rulesets.py       # legacy ruleset config (NYT used; PuzzGrid kept for future revival)\n└── utils.py          # GuessRecord dataclass, item formatting helpers\n```\n","encoding":"utf-8","truncated":false,"total_bytes":4543},"status":null}