{"data":{"kind":"file","path":"README.md","version_id":"zo3wd8fckvp7raaacrh5mf07","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":9033,"modified_at":"2026-05-25T17:36:40.991000","content_hash":"e8c55bab74cbd2b42c836bc78624f680630abb5570f6f9f606be839a157b6bab"},"entries":[],"content":"# context-tools\n\nSandboxed Python-REPL harness for training models to **manage their own context** across turns. The current default data mix combines adaptive-cursor ledger tasks with realistic corpus-trail research synthesis, so raw appending fails under `context_rewrite=True` and compact state management is the reliable path.\n\n## How it works\n\nEach rollout gets a per-rollout Prime Sandbox running a long-lived Python worker subprocess (re-used from `RLMEnv`). The worker keeps a persistent namespace `dict` across calls; it's plain Python — no Jupyter kernel, no `ipykernel`. The worker uses `ast.parse → exec/eval` so a trailing expression's `repr` lands in the result dict (similar to IPython's `Out[N]`).\n\nThe model has one tool — `call_python_repl(code: str)` — and a per-rollout namespace pre-seeded with:\n\n- The world's tool functions (e.g. `observe` / `get_entity` / `look` / `read_event` / etc.) bound to that rollout's hidden state.\n- `submit_answer(value)` — terminates the rollout with a final answer.\n- (`context_rewrite=True` only) `context_window: list` — the model-owned persistent memory across turns. The task text is shown separately; every `context_window` slot is hard-truncated under the same cap.\n\nThe toggle `context_rewrite` selects the prompting flavor:\n\n- **`context_rewrite=True` (default).** Fresh `[system, user]` every turn; the user message renders the static task text plus the hard-truncated current `context_window` (plus the previous turn's code if it errored). The trajectory is never visible to the model.\n- **`context_rewrite=False`.** Standard tool-calling flow — the model sees the full conversation history, each `call_python_repl` returns the truncated execution output as a normal tool message.\n\n## Files\n\n```\ncontext_tools.py    ContextToolsEnv (subclasses RLMEnv) + load_environment\ntaskset.py          ContextToolsTaskSet (data wrapper)\ngenerators/         deterministic, solver-verified data pipeline\nscripts/            data-generation CLIs, including build_context_mix.py\nmy_data/            default mixed train/eval JSONL files plus per-family splits\n```\n\n## Task families\n\nAll have a single submitted answer per rollout, programmatically verifiable.\n\n| Family | Tools | Scratchpad mechanic | Question shape |\n|---|---|---|---|\n| `rule_hunt` | `get_entity`, `test` | edit a hypothesis as evidence narrows | submit a parse-tree rule |\n| `corpus_dive` | `list_keys`, `read_node` | prune mostly-noise observations | count/sum under a subtree path |\n| `timeline_track` | `read_event`, `read_events` | overwrite mutating fixed-schema state | owner/count at time T |\n| `detective` | `get_entity`, `query_attribute` | shrink a candidate set via elimination | unique entity satisfying all conjunctive constraints |\n| `maze_walk` | `look`, `move` | push/pop discipline (path advance + backtrack) | navigate to goal, submit goal's secret |\n| `adaptive_cursor` | `observe` | choose what returned cursor-page content to preserve | checkpoint ledger audit rows |\n| `corpus_trail` | `search_docs`, `read_doc` | retain durable source-tagged facts across a noisy research DAG | structured project risk brief with evidence ids |\n\nThe default train/eval mix is 60% `adaptive_cursor` and 40% `corpus_trail`, calibrated for from-scratch training with zero-gradient filtering. `adaptive_cursor` uses d0/d1/d2/d3/d4 at `12/23/35/22/8` within the family so early training has a broad easy on-ramp before harder ledger-update tasks enter. `corpus_trail` uses d0/d1/d2/d3/d4 at `20/35/30/12/3` within the family; d0/d1 are the search/read bridge, while d2+ provide the main source-synthesis frontier. There is no small manufactured per-turn tool-call limit. In `context_rewrite=True`, `observe(handle)`, `search_docs(...)`, and `read_doc(...)` are ordinary Python functions; the next prompt is only the hard-truncated render of whatever the model itself placed in `context_window`.\n\n`corpus_trail` is an answer-first research family. Each example samples a final JSON brief, constructs a hidden evidence DAG with reusable facts such as aliases and policy rules, renders that DAG into verbose source documents plus distractors, and seeds the REPL with a long `briefing_note`. `search_docs(...)` returns locator-only ids plus non-evidentiary snippets; it intentionally omits titles, dates, source kinds, and answer-bearing text. Retrieval is controlled by hidden search terms rather than rendered titles/body text, so d0/d1 can expose a readable \"read source, keep key, search next key\" chain without reintroducing search-result leakage. The model must call `read_doc(...)` on source ids it relies on and keep compact notes because raw gold documents are several times larger than the per-example context cap. Unlike adaptive-cursor, corpus-trail uses final exact JSON correctness only; there is no partial process reward for this family.\n\nLegacy families are solver-verified at generation time. `corpus_trail` is answer-first instead: the generator builds the answer and hidden evidence DAG first, then renders only the public documents; builder smoke checks verify that every gold evidence document is reachable through the exposed search terms and that compact gold notes fit while raw evidence overflows the per-example cap.\n\n## Quickstart\n\n```bash\nprime env install context-tools\n\n# Re-generate the default mixed train/eval sets\npython scripts/build_context_mix.py\n\n# Smoke-test eval\nprime eval run context-tools -m gpt-4.1-mini -n 5 -r 1\n```\n\n`PRIME_API_KEY` (for sandbox provisioning) is read automatically from `~/.prime/config.json` if not in the env. Provider key (e.g. `OPENAI_API_KEY`) you'll need to export yourself.\n\n## Environment arguments\n\n| Arg | Type | Default | Description |\n|-----|------|---------|-------------|\n| `dataset_path` | str | `my_data/train_context_mix.jsonl` | Training JSONL (8,000 rows: 60% adaptive_cursor, 40% corpus_trail) |\n| `eval_path` | str | `my_data/eval_context_mix.jsonl` | Held-out eval JSONL (800 rows with the same default mix) |\n| `context_rewrite` | bool | `True` | True: model curates `context_window`. False: standard tool-calling flow. |\n| `max_turns` | int | 15 | Max rollout turns |\n| `max_context_chars` | int | 400 | Display cap on rendered model-curated `context_window` slots (cr=True) / per-tool-response cap (cr=False) |\n| `max_code_display_chars` | int | 4000 | Display cap on echoed previous code (cr=True only) |\n| `show_previous_code` | bool | `False` | (cr=True only) If True, echo prior code every turn; default echoes only on error |\n| `tool_call_budget_per_turn` | int | 1000000 | No small manufactured tool-call cap by default; adaptive-cursor progress is gated by semantic route choices in ordinary returned page strings |\n| `sandbox_docker_image` | str | `python:3.11-slim` | Sandbox image |\n| `code_execution_timeout` | int | 120 | Per-turn code timeout (seconds) |\n| `sandbox_cpu_cores` | int | 1 |  |\n| `sandbox_memory_gb` | int | 2 |  |\n| `sandbox_timeout_minutes` | int | 30 | Hard sandbox lifetime cap |\n| `retain_filesystem_after_rollout` | bool | `False` | Keep `/rlm_fs` for post-mortem |\n\n## Rewards\n\n| Reward | Weight | Definition |\n|---|---|---|\n| `task_reward` | 1.0 | Capped at 1.0. Exact submitted answer gets 1.0. For adaptive-cursor misses, partial credit is terminal-gated: before the correct terminal page is observed, reward is 0. After terminal, partial credit is `0.05 * complete_valid_submit + 0.10 * checkpoint_ids_in_order + 0.85 * submitted_checkpoint_row_fraction`. |\n| `correctness_reward` | 0.0 | Exact-answer metric only. |\n| `checkpoint_row_submit_fraction` | 0.0 | Metric: exact gold checkpoint rows present in the submitted answer. |\n| `checkpoint_row_context_fraction` | 0.0 | Metric: exact gold checkpoint rows visible in the final hard-truncated `context_window`. |\n| `valid_checkpoint_submit` | 0.0 | Metric: submitted answer is a non-empty list of 4-field rows. |\n| `complete_checkpoint_submit` | 0.0 | Metric: submitted answer has one valid row per expected checkpoint. |\n| `checkpoint_ids_in_order` | 0.0 | Metric: submitted rows use the expected checkpoint ids in order. |\n| `adaptive_terminal_reached` | 0.0 | Metric: the correct adaptive-cursor terminal page was observed. |\n\nAdditional diagnostic metrics include append/edit counts, dynamic overwrite/remove counts, final manifest character count, truncation count, and turn efficiency.\n\n## Data generation invariants\n\n- **100% synthetic, 100% verifiable**: ground truth is a deterministic function of the generated state.\n- **100% solvable from observations**: legacy ground truth is recomputed independently before each example is emitted; answer-first corpus tasks are emitted only when their gold evidence docs are reachable through the public search/read surface.\n- **Single answer per task**: every example terminates with one `submit_answer(...)` call.\n- **Difficulty stratified**: default training set is stratified across difficulties 0-4 for the adaptive-cursor template, with the d1/d2-lite bridge emphasized.\n","encoding":"utf-8","truncated":false,"total_bytes":9033},"status":null}