{"data":{"kind":"file","path":"README.md","version_id":"h756psc16cdsk3y9ag39hbld","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":3306,"modified_at":"2025-10-02T16:25:29.670000","content_hash":"92b2af4ca4abe470673cb11884ea3f9a7831f35ad0145792262feb1499e2aac8"},"entries":[],"content":"# vf-thick-fog\n\n### Overview\n\n* **Environment ID**: `vf-thick-fog`\n* **Short description**: Multi-turn 2D grid search game with noisy L1-distance probes under \"thick fog\".\n* **Tags**: reasoning, calibration, tool-use, synthetic, search\n\n### Datasets\n\n* **Primary dataset(s)**: Synthetic dataset of hidden target coordinates `(x*, y*)` in a bounded 2D integer grid (default in `[-50, 50]^2`).\n* **Noise parameter `d`**: sampled *per episode* uniformly from an integer interval (default `[4, 9]`).\n* **Source links**: *synthetic, generated on-the-fly*\n* **Split sizes**: default `num_examples=1000` synthetic samples (can configure via `--env-args`).\n\n### Task\n\n* **Type**: multi-turn with tool-use (the model may call the tool `ping` multiple times before producing a final answer).\n* **Parser**: custom parser that extracts the last two integers `\"x y\"` from the model’s final response.\n* **Rubric overview**:\n\n  * `correct_answer`: +600 if guessed `(x,y)` matches the hidden target.\n  * `pings_number`: −1 per individual ping sample used (all calls count).\n  * `token_cost`: diagnostic only, tracks total tokens consumed (weight 0).\n\n### Quickstart\n\nRun an evaluation with default settings:\n\n```bash\nuv run vf-eval vf-thick-fog\n```\n\nConfigure model and sampling:\n\n```bash\nuv run vf-eval vf-thick-fog \\\n  -m gpt-5-nano \\\n  -n 20 -r 3 -t 1024 -T 0.7 \\\n  -a '{\"target_range\": 50, \"d_min\": 4, \"d_max\": 9}'\n```\n\nNotes:\n\n* Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\n* If you want a fixed noise radius instead of a range, pass `{\"d\": 5}`.\n\n### Environment Arguments\n\n| Arg            | Type | Default | Description                                                                                  |\n| -------------- | ---- | ------- | -------------------------------------------------------------------------------------------- |\n| `num_examples` | int  | `1000`  | Number of synthetic samples generated                                                        |\n| `target_range` | int  | `50`    | Sampled targets are uniform in `[-target_range, target_range]^2`                             |\n| `d`            | int  | `None`  | If set, fixes the noise radius `d` for all episodes                                          |\n| `d_min`        | int  | `4`     | Minimum noise radius if `d` is not specified                                                 |\n| `d_max`        | int  | `9`     | Maximum noise radius if `d` is not specified                                                 |\n| `seed`         | int  | `42`    | Random seed for reproducible dataset                                                         |\n| `max_pings`    | int  | `100`   | Total ping budget `B`, the sum of all `k` across tool calls (episode terminates if exceeded) |\n\n### Metrics\n\n| Metric           | Meaning                                                      |\n| ---------------- | ------------------------------------------------------------ |\n| `reward`         | Main scalar reward: `100*is_correct − pings_number`          |\n| `correct_answer` | 1.0 if final guess matches hidden target, else 0.0           |\n| `pings_number`   | Total number of ping samples used (all calls count)          |\n| `token_cost`     | Total tokens consumed (prompt + completion); diagnostic only |\n","encoding":"utf-8","truncated":false,"total_bytes":3306},"status":null}