{"data":{"kind":"file","path":"README.md","version_id":"qd60rfrk5hyii3bft1xetbpb","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":6060,"modified_at":"2025-09-06T23:55:03.061000","content_hash":"0fbb78d6f930687eeead31b4446ab3ab58a61233981eae19655df63e1b7d321e"},"entries":[],"content":"# vf-context-span-toolenv\n\nToolEnv that trains/evals a model to extract a verbatim, self-contained excerpt around a target span for high-quality embeddings, with an exact token budget via tiktoken and peek tools (sentence windows) under efficiency penalties.\n\n## Overview\n\n- **Environment ID (module):** `vf_context_span_toolenv`\n- **Hub slug (package name):** `vf-context-span-toolenv`\n- **Short description:** Extract a contiguous, verbatim excerpt that includes a given target string, fits a strict token budget (tiktoken), resolves references, and ends on clean boundaries using limited peek tools.\n- **Keywords:** verifiers, ToolEnv, embeddings, chunking, RL, rubric\n\n## Datasets\n\n- **Primary dataset:** Synthetic documents generated at load time across five types (technical, business, legal, medical, academic) with short, realistic sections (API errors, financial highlights, license terms, protocol dosing, research results). Targets are phrases present in the doc (e.g., `ES-401: Authentication failure.`).\n- **Source links:** (synthetic; generated by the environment—no external sources)\n- **Split sizes:** Generated on demand (default `num_examples=800`). There are no fixed train/test splits; set seeds for repeatability or sample smaller subsets with CLI flags.\n\n## Task\n\n- **Type:** Tool use (Verifiers ToolEnv with hidden document + tools)\n- **I/O format:** Chat style (user prompt lists doc_id, title, type, and the TARGET string; model must call tools to peek context and return raw excerpt text only)\n- **Tools:**\n  - `get_meta(doc_id)` → `{title, doc_type, sentence_count}`\n  - `count_occurrences(doc_id, target)` → `{count}`\n  - `peek_window(doc_id, target, occurrence=1, left=1, right=1, unit=\"sentence\", hard_limit=600)` → returns contiguous window around the chosen occurrence\n- **Parser:** None (raw text output; no quotes, tags, or prefixes)\n- **Rubric overview (multi-criteria reward):**\n  - `contains_target` — excerpt must include the exact target\n  - `verbatim_substring` — excerpt must be an exact substring of the doc (no paraphrase)\n  - `token_budget` — exact token count via tiktoken against max_context_tokens\n  - `boundary_reward` — starts/ends on natural boundaries (sentence-like)\n  - `self_contained` — resolves pronouns/ambiguity via cues (names, numerics)\n  - `disambiguation` — if target appears multiple times, choose a uniquely identifying window\n  - `format_compliance` — no boilerplate (e.g., \"Context:\"), no code fences/quotes\n  - `tool_efficiency` — penalizes excessive/large peeks vs. final excerpt size; mild penalty >5 tool calls\n\n*(Rubrics & ToolEnv are first-class concepts in Verifiers; see docs.)*\n\n## Tokenization\n\nThis environment uses tiktoken for exact token budgets. You can either pass an explicit encoding name (default `cl100k_base`) or an `encoding_model` (e.g., an embedding model) and let `tiktoken.encoding_for_model()` choose the right tokenizer.\n\nFor general OpenAI embedding usage guidance, see the [official docs](https://platform.openai.com/docs/guides/embeddings).\n\n## Quickstart\n\nEvaluate with defaults (uses your Verifiers-compatible model endpoint):\n\n```bash\nuv run vf-eval vf_context_span_toolenv\n```\n\nSpecify model and sampling:\n\n```bash\nuv run vf-eval vf_context_span_toolenv \\\n  -m gpt-4o-mini \\\n  -n 20 -r 3 \\\n  --env-args '{\"max_context_tokens\":128,\"encoding_name\":\"cl100k_base\"}'\n```\n\n**Tip:** use `vf-eval --help` for available flags in your installed Verifiers version (CLI surface may evolve). See [Verifiers docs](https://github.com/anthropics/verifiers).\n\n## Environment Arguments\n\n| Arg | Type | Default | Description |\n|-----|------|---------|-------------|\n| `num_examples` | `int` | `800` | Number of generated items per load. |\n| `max_context_tokens` | `int` | `128` | Exact token budget enforced by tiktoken. |\n| `seed` | `int` | `42` | RNG seed for reproducible generation. |\n| `encoding_name` | `str\\|None` | `\"cl100k_base\"` | Explicit tokenizer name (e.g., `cl100k_base`, `o200k_base`). |\n| `encoding_model` | `str\\|None` | `None` | If provided (e.g., `text-embedding-3-large`), uses `encoding_for_model()` to select the tokenizer. |\n\n## Metrics\n\n| Metric | Meaning |\n|--------|---------|\n| `reward` | Weighted sum across rubric criteria (main score). |\n| `contains_target` | 1.0 if excerpt contains the exact target string. |\n| `verbatim_substring` | 1.0 if output is an exact substring of the document. |\n| `token_budget` | 1.0 (scaled) if tiktoken count ≤ budget; penalties if over. |\n| `boundary_reward` | Boundary cleanliness (starts/ends like a sentence). |\n| `self_contained` | Rewards cues (proper names, numerics) when pronouns are present. |\n| `disambiguation` | Rewards choosing uniquely identifying context when duplicates exist. |\n| `format_compliance` | Penalizes quotes, code fences, and boilerplate prefixes. |\n| `tool_efficiency` | Penalizes excessive/large peeks relative to final excerpt; mild penalty for >5 tool calls. |\n| `ToolRubric.*` | (From Verifiers' built-in ToolRubric) tool call counters/auxiliary stats. |\n\n## Installation & Packaging\n\nThis environment is a single-module wheel with `pyproject.toml` using PEP 621 metadata and Hatchling as the build backend.\n\n```bash\npip install verifiers datasets tiktoken\nvf-install vf_context_span_toolenv\n```\n\n- PEP 621 defines standardized `[project]` metadata in `pyproject.toml`. [Reference](https://peps.python.org/pep-0621/)\n- Hatch/Hatchling config for builds is documented [here](https://hatch.pypa.io/latest/config/build/).\n\n## Model Tips\n\n- If you embed with an OpenAI model, pass `encoding_model=\"text-embedding-3-large\"` (or your target model) so the environment counts tokens with the correct encoding via `tiktoken.encoding_for_model()`.\n- See [OpenAI's embeddings guide](https://platform.openai.com/docs/guides/embeddings) for general best practices when building retrieval pipelines.\n\n## Changelog\n\n- **0.1.0** — Initial release (ToolEnv with `peek_window`, token budgets via tiktoken, efficiency penalties, synthetic corpus across 5 doc types).\n\n---\n","encoding":"utf-8","truncated":false,"total_bytes":6060},"status":null}