{"data":{"kind":"file","path":"README.md","version_id":"g4ax101s5fxaovqeg8gsyk4z","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":4220,"modified_at":"2025-10-02T05:47:09.105000","content_hash":"b7da095eb455165e63d18ccf794d0d3ed6352d0bb3d28980d504ce4319d013ff"},"entries":[],"content":"# README.md\n\n# vf-binary-liar\n\n### Overview\n- **Environment ID**: `vf-binary-liar`\n- **Short description**: Single-turn number-guessing environment with tool calling and per-episode noisy hints. The agent may call `probe(guess, p_guess)` multiple times, then must output a single final integer.\n- **Tags**: single-turn, tool-use, calibration, robustness, synthetic, reproducibility\n\n### Datasets\n- **Primary dataset**: Synthetic, generated on the fly via `_build_dataset` with per-episode `lie_prob ~ Uniform(l, u)`.\n- **Source links**: N/A (fully synthetic).\n- **Split sizes**: Controlled by `num_examples` (default 1000). No external train/eval split; each run samples a fresh set.\n\n### Task\n- **Type**: Single-turn dialogue with tool use (function calling). The model may issue multiple tool calls during one logical episode; only the final message is scored as the answer.\n- **Tool**: `probe(guess: int, p_guess: float)` → returns `{correct: bool, hint: 'higher'|'lower'|'none'}`; hints may flip according to a hidden per-episode lie probability.\n- **Parser**: `ThinkParser` with a lightweight numeric extractor (reads the last integer from the final message).\n- **Rubric overview**:\n  - **Correctness**: +`w_correct` if the parsed final integer equals the ground truth.\n  - **Calibration**: −`c_cal` × (sum of Brier losses across all `probe` calls), where each call contributes `(p_guess − 1{guess==answer})^2`.\n  - **Token usage**: tracked via `token_cost(state)`; weight is 0 by default (can be extended if desired).\n  - **Tool calls**: `ToolRubric` applies −`c_probe` per `probe` call.\n\n### Quickstart\nRun with defaults (uses a synthetic dataset):\n```bash\nuv run vf-eval vf-binary-liar\n```\n\nSpecify a model and sampling controls:\n```bash\nuv run vf-eval vf-binary-liar \\\n  -m gpt-5-nano \\\n  -n 20 -r 3 \\\n  --max-concurrent-requests 32\n```\n\nCustomize the environment via JSON args:\n```bash\nuv run vf-eval vf-binary-liar \\\n  -m gpt-5-nano \\\n  -n 50 -r 2 \\\n  -a '{\n    \"num_examples\": 50,\n    \"low\": 1,\n    \"high\": 100,\n    \"lie_prob_range\": [0.0, 0.4],\n    \"seed\": 42,\n    \"max_turns\": 12,\n    \"w_correct\": 100.0,\n    \"c_cal\": 10.0,\n    \"c_probe\": 1.0\n  }'\n```\n\nNotes:\n- `lie_prob_range` sets the per-episode flip probability for hints; the actual value is hidden from the model.\n- `max_turns` caps the number of tool/assistant turns allowed in one episode. If ≤ 0, it defaults to the range size (`high-low+1`).\n- Reproducibility is ensured by deterministic, per-episode noise (hidden seed) and fixed prompts.\n\n### Environment Arguments\n| Arg | Type | Default | Description |\n| --- | ---- | ------- | ----------- |\n| `num_examples` | int | `1000` | Number of synthetic episodes to generate. |\n| `low` | int | `1` | Lower bound (inclusive) for the secret number. |\n| `high` | int | `100` | Upper bound (inclusive) for the secret number. |\n| `lie_prob_range` | tuple(float,float) | `(0.0, 0.4)` | Range for per-episode lie probability `U[l,u]` used to (possibly) flip hints. |\n| `seed` | int | `42` | RNG seed for dataset generation. |\n| `max_turns` | int | `-1` | Max dialogue/tool turns per episode; if ≤ 0, becomes `high-low+1`. |\n| `w_correct` | float | `100.0` | Reward for a correct final answer. |\n| `c_cal` | float | `10.0` | Multiplier (penalty) for summed Brier loss over probes. |\n| `c_probe` | float | `1.0` | Penalty per `probe` tool call. |\n\n### Metrics\n| Metric | Meaning |\n| ------ | ------- |\n| `reward` | Weighted sum from the rubric group (correctness − calibration penalty − tool-call penalty; token cost tracked separately by default). |\n| `accuracy` | Fraction of episodes with exact-match final integer. |\n| `brier_sum` | Sum over probes of `(p_guess − 1{guess==answer})^2` (lower is better). |\n| `tool_calls` | Count of `probe` calls per episode (lower is better, all else equal). |\n| `tokens` | Total tokens accounted in API responses; tracked for analysis. |\n\n### Contract the agent must follow\n- May call `probe(guess, p_guess)` multiple times while reasoning.\n- Must end with a single integer in the final assistant message (no extra text). The parser extracts the last integer as the answer.\n- Calibrate `p_guess` honestly; over/under-confidence is penalized via Brier.","encoding":"utf-8","truncated":false,"total_bytes":4220},"status":null}