{"data":{"kind":"file","path":"README.md","version_id":"uz6mhy36tjxq85canit65h7p","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":6848,"modified_at":"2026-05-29T03:31:11.038000","content_hash":"321660789a8d42afa4b9211b1ba82a200975c7ed774ada8ae87b16d853f9eb61"},"entries":[],"content":"# click-calibrate\n\n### Overview\n- **Environment ID**: `click-calibrate`\n- **Short description**: Multi-turn visual click calibration for models that emit raw pixel or normalized coordinates through either a direct `click` tool or a `computer` tool.\n- **Tags**: visual, click, calibration, eval\n\n### Datasets\n- **Primary dataset(s)**: Synthetic PNG screenshots generated at `load_environment()` time.\n- **Task families**:\n  - `aim`: a blank screen with one colored circle; click the center of the circle.\n  - `text`: a simple rendered page/document; click a named word.\n- **Default size**: 80 generated examples.\n\n### Task\n- **Type**: multi-turn visual tool use.\n- **Expected action**: call the configured tool once per turn until the target is hit or the turn limit is reached.\n- **Stop condition**: rollout ends on a target hit or after `max_turns` model turns.\n- **Tool formats**:\n  - `click`: direct click tool, using `click(x_px, y_px)` for pixels or `click(x, y)` for normalized coordinates.\n  - `computer`: computer-use style tool, using `computer(actions=[{\"action\": \"left_click\", \"coordinate\": [x, y]}])`. The scorer also accepts the native top-level form `computer(action=\"left_click\", coordinate=[x, y])`.\n- **Coordinate modes**:\n  - `pixel`: raw image pixels.\n  - `norm1000`: normalized integer coordinates `[0, 1000]`.\n  - `norm999`: normalized integer coordinates `[0, 999]`.\n- **Scoring**: the emitted click is converted to pixels and checked against the target circle or text bounding box.\n\n### Quickstart\nInstall the environment:\n\n```bash\nprime env install click-calibrate\n```\n\nRun a small pixel-coordinate eval:\n\n```bash\nprime eval run click-calibrate -m openai/gpt-4.1-mini -n 5 -r 1 -a '{\"coordinate_mode\": \"pixel\"}'\n```\n\nRun normalized-coordinate variants:\n\n```bash\nprime eval run click-calibrate -m openai/gpt-4.1-mini -n 5 -r 1 -a '{\"coordinate_mode\": \"norm1000\"}'\nprime eval run click-calibrate -m openai/gpt-4.1-mini -n 5 -r 1 -a '{\"coordinate_mode\": \"norm999\"}'\n```\n\nRun the same coordinate modes through the `computer` tool:\n\n```bash\nprime eval run click-calibrate -m anthropic/claude-haiku-4.5 -n 5 -r 1 -a '{\"tool_format\": \"computer\", \"coordinate_mode\": \"pixel\"}'\nprime eval run click-calibrate -m anthropic/claude-haiku-4.5 -n 5 -r 1 -a '{\"tool_format\": \"computer\", \"coordinate_mode\": \"norm1000\"}'\n```\n\n### Visualize Saved Runs\nGenerate a static HTML index for all saved runs under `outputs/evals`:\n\n```bash\nuv run python environments/click_calibrate/view_eval.py\n```\n\nGenerate a viewer for one saved eval run directory or `results.jsonl` file:\n\n```bash\nuv run python environments/click_calibrate/view_eval.py \\\n  environments/click_calibrate/outputs/evals/<run>/results.jsonl\n```\n\nThe viewer overlays the ground-truth target in green and the model's attempted click in red. Use the on-page arrows or keyboard left/right arrows to move between samples.\n\n### Environment Arguments\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `coordinate_mode` | str | `\"pixel\"` | One of `pixel`, `norm1000`, or `norm999`. |\n| `tool_format` | str | `\"click\"` | One of `click` or `computer`. |\n| `dataset_size` | int | `80` | Number of synthetic examples to generate. |\n| `task_mix` | str | `\"aim,text\"` | Comma-separated task families. |\n| `seed` | int | `0` | Random seed for deterministic generation. |\n| `width` | int or null | `null` | Fixed image width. Must be paired with `height`. |\n| `height` | int or null | `null` | Fixed image height. Must be paired with `width`. |\n| `tool_name` | str or null | `null` | Optional override for the exposed tool name. Defaults to `click` for `tool_format=\"click\"` and `computer` for `tool_format=\"computer\"`. |\n| `text_fallback` | bool | `false` | Include JSON fallback instructions for non-tool-call endpoints. Leave disabled for click-tool calibration. |\n| `sweep_tag` | str or null | `null` | Optional label saved in run metadata and sample info for distinguishing eval sweeps. |\n| `max_turns` | int | `10` | Maximum attempts per rollout. |\n\n### Metrics\n| Metric | Meaning |\n| ------ | ------- |\n| `click_hit` | Main reward: 1 if any attempted click lands on the target, else 0. |\n| `first_turn_accuracy` | 1 if the first model turn hits the target. |\n| `last_turn_accuracy` | 1 if the final model turn hits the target. |\n| `click_extracted` | 1 if any click could be parsed from tool calls or text fallback. |\n| `coordinate_valid` | 1 if the final emitted click has coordinates in the declared range. |\n| `tool_call_used` | 1 if the model used the configured tool instead of text fallback. |\n| `distance_px` | Zero-weight metric: pixel distance from the final attempted click to the target center. Lower is better. |\n| `target_click_distance_px` | Zero-weight alias for `distance_px` with a more explicit name. |\n\n### v0.1.0 Results\nCorrected sweep over 5 models, 2 coordinate modes, and 5 image sizes with `-n 20 -r 5 -s`.\n\nAggregate results:\n\n| Model | Mode | Reward | Distance px | Click extracted | Tool call used | Truncation | Error |\n| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: |\n| `openai/gpt-5.5` | `pixel` | 0.910 | 10.8 | 1.000 | 1.000 | 0.000 | 0.000 |\n| `openai/gpt-5.5` | `norm1000` | 0.908 | 10.9 | 1.000 | 1.000 | 0.000 | 0.000 |\n| `anthropic/claude-opus-4.7` | `pixel` | 0.900 | 11.7 | 1.000 | 1.000 | 0.000 | 0.000 |\n| `anthropic/claude-opus-4.7` | `norm1000` | 0.898 | 13.7 | 1.000 | 1.000 | 0.000 | 0.000 |\n| `anthropic/claude-haiku-4.5` | `pixel` | 0.826 | 21.1 | 1.000 | 1.000 | 0.000 | 0.000 |\n| `anthropic/claude-haiku-4.5` | `norm1000` | 0.712 | 27.8 | 1.000 | 1.000 | 0.000 | 0.000 |\n| `qwen/qwen3.6-35b-a3b` | `pixel` | 0.174 | 234.2 | 0.966 | 0.972 | 0.008 | 0.004 |\n| `qwen/qwen3.6-35b-a3b` | `norm1000` | 0.588 | 186.4 | 0.940 | 0.940 | 0.008 | 0.024 |\n| `moonshotai/kimi-k2.6` | `pixel` | 0.790 | 39.2 | 0.988 | 0.834 | 0.000 | 0.000 |\n| `moonshotai/kimi-k2.6` | `norm1000` | 0.746 | 57.2 | 0.994 | 0.842 | 0.000 | 0.000 |\n\nReward by image size:\n\n| Model | Mode | 1024x768 | 1440x900 | 1280x720 | 1000x1000 | 800x600 |\n| --- | --- | ---: | ---: | ---: | ---: | ---: |\n| `openai/gpt-5.5` | `pixel` | 0.85 | 0.85 | 1.00 | 0.85 | 1.00 |\n| `openai/gpt-5.5` | `norm1000` | 0.84 | 0.85 | 1.00 | 0.85 | 1.00 |\n| `anthropic/claude-opus-4.7` | `pixel` | 0.85 | 0.85 | 0.95 | 0.85 | 1.00 |\n| `anthropic/claude-opus-4.7` | `norm1000` | 0.85 | 0.85 | 0.94 | 0.85 | 1.00 |\n| `anthropic/claude-haiku-4.5` | `pixel` | 0.85 | 0.45 | 0.99 | 0.84 | 1.00 |\n| `anthropic/claude-haiku-4.5` | `norm1000` | 0.72 | 0.22 | 0.83 | 0.85 | 0.94 |\n| `qwen/qwen3.6-35b-a3b` | `pixel` | 0.09 | 0.01 | 0.04 | 0.71 | 0.02 |\n| `qwen/qwen3.6-35b-a3b` | `norm1000` | 0.60 | 0.48 | 0.51 | 0.69 | 0.66 |\n| `moonshotai/kimi-k2.6` | `pixel` | 0.77 | 0.56 | 0.94 | 0.77 | 0.91 |\n| `moonshotai/kimi-k2.6` | `norm1000` | 0.65 | 0.68 | 0.88 | 0.78 | 0.74 |\n","encoding":"utf-8","truncated":false,"total_bytes":6848},"status":null}