{"data":{"kind":"file","path":"README.md","version_id":"a8uoihancyy45lxcp6voy3tf","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":11011,"modified_at":"2026-06-02T13:09:54.508000","content_hash":"b4d2a9c6745e6ed0ebbe8646be8c8280a9e050204ad1d1a7f07cbdbc6e00ed3e"},"entries":[],"content":"# forth_lang\n\nMulti-turn tool-using-agent environment on Forth code execution. The model defines a Forth word that satisfies a natural-language specification, using three sandboxed tools (`submit_code`, `run_code`, `lookup_doc`). Reward is binary: 1.0 if the latest submission passes all hidden test cases, else 0.0.\n\n## Env config (verifiers v1)\n\n`[env.taskset]` / `[env.harness]` sections validate directly against `ForthLangTasksetConfig` / `ForthLangHarnessConfig`.\n\n### Taskset fields (`[env.taskset]`)\n\n| Field | Default | Notes |\n|---|---|---|\n| `tiers` | `None` (all) | List of tier ids (0-5) to include. Caller chooses train/eval splits by instantiating one env per split. |\n| `categories` | `None` (all) | List of category names to include. See `forth_lang.tasks.CATEGORIES`. |\n| `word_to_call` | `None` (all) | List of `word_to_call` ids to include (unique, stable task ids — each task defines a unique Forth word). Unknown ids raise `ValueError`. AND-composed with `tiers` / `categories`. |\n| `exclude_word_to_call` | `None` (none) | List of `word_to_call` ids to drop. Use the same list as `word_to_call` for eval + `exclude_word_to_call` for train to define a held-out test set once. |\n| `holdout_fraction` | `None` (off) | Hash-based deterministic split on `word_to_call`. Set the same value (e.g. `0.2`) on both the train and eval env declarations with matching `holdout_seed`; the train env gets the ~(1-fraction) complement, the eval env gets the fraction holdout (auto-derived from `get_dataset` vs `get_eval_dataset`). |\n| `holdout_seed` | `0` | Salt for the holdout hash. Keep fixed across train+eval so the two sides are complementary. |\n| `dataset_repo` | `PrimeIntellect/forth-lang-tasks` | Public HF repo id or local path that `datasets.load_dataset` can read. |\n| `sandbox.image` | `team-clyvldofb0000gg1kx39rgzjq/forth-lang:v3` | Sandbox image — a baked image with gforth + python3 + bm25s + the docs bundle. |\n| `sandbox.cpu_cores`, `sandbox.memory_gb`, `sandbox.disk_size_gb`, `sandbox.timeout_minutes`, `sandbox.command_timeout` | `1.0`, `1.0`, `2.0`, `30`, `15` | Standard `vf.SandboxConfig` fields. |\n| `sandbox_labels` | `[]` | Labels appended to `sandbox.labels` at config-load time, so per-cell configs can tag sandboxes (cleanup, quota tracking, dashboard grouping) without overriding the full `sandbox` block. |\n| `system_prompt` | (built-in Forth-tutor) | Override the default prompt. |\n\n### Harness fields (`[env.harness]`)\n\n| Field | Default | Notes |\n|---|---|---|\n| `max_turns` | 30 | Hard cap on assistant turns per rollout. |\n\n## Changelog\n\n### v0.3.0\n\n- **Removed the task-generation pipeline** (`task_generation/` — `aggregate.py`, `reverify.py`, `run_filter.py`, `sanity_check_trivial.py`, `verify.py`, `verify_task.py`, `upload_to_hf.py`, ~1,300 lines). It produced the HF dataset once and was never used at runtime; recover from git history if a rebuild is needed (last present in commit `693f8c430`).\n- **Migrated to verifiers v1 (Taskset / Harness).** Replaces the `SandboxMixin + StatefulToolEnv` subclass and the separate `ForthLangRubric` with a slim `ForthLangTaskset` + bare `vf.Harness`. Net delete: ~340 lines of glue (per-tool `add_tool` + `args_to_skip` + `update_tool_args`, custom `setup_state`, two `@vf.cleanup` methods, the Rubric subclass, the whole `sandbox_helpers.py`).\n- **Sandbox lifecycle is framework-owned.** Toolset declares the sandbox config; v1 provisions the per-rollout lease and releases it in `cleanup_rollout`. No more `init_sandbox_client` / `create_sandbox` / `delete_sandbox` calls in env code.\n- **`run_code.word_to_call` is now hidden from the model.** Bound from `task.word_to_call`, the model never has to (and can't fail to) pass it. Removes a real per-turn failure mode.\n- **Hidden-test verifier is now a `@vf.reward(priority=10)`.** Drives the same in-rollout `run_code` callable the model uses via a `passed.run_code = \"tools.run_code\"` binding. The four diagnostic signals (`pass_rate`, `has_error`, `banned_violation`, `submission_error_rate`) are priority-0 `@vf.metric` functions that read state.\n- **TOML config promoted to first-class.** `[env.taskset]` / `[env.harness]` sections validate against `ForthLangTasksetConfig` / `ForthLangHarnessConfig` directly. CLI overrides go through the typed config too — e.g. `vf-eval forth-lang -a '{\"config\": {\"taskset\": {\"tiers\": [5]}}}'`.\n- **Per-row knobs available for free.** Task rows can set `max_turns`, `sandbox`, and `tools` show/hide for per-task sizing and action-space control — no code changes needed.\n- **API drops:** `ForthLangEnv` and `ForthLangRubric` classes are gone; `load_environment` now takes a single `vf.EnvConfig`. Public surface from `forth_lang`: `load_environment`, `ForthLangTaskset`, `ForthLangTasksetConfig`, `ForthLangHarnessConfig`.\n\n### v0.2.0\n\n- **New tier scheme T0-T5.** Tiers are now derived empirically from glm-5.1 pass-rate at **10 rollouts/task** (default budget, max_turns=30). `T0 = pr==1.0` (perfectly solved across all 10 rollouts × full test pass); T1..T5 are equal-rank quantile bins over the remaining `pr<1.0` cohort. Replaces the prior T0-T8 *solver-filter* scheme. The 4-pass solver filter pipeline still lives in `scripts/run_filter.py` and is used for task generation, but its tier output is no longer the canonical task label.\n- **+31 Phase 3 tasks** generated for previously-zero-coverage categories (arithmetic, comparison-and-logic, conditionals, indefinite-loops, recursion, variables-and-memory). After the N=10 re-tier, 6 survived as non-T0 (`poly-eval`, `walk-sum` → T1; `ilog-budget`, `pow-mod-find` → T2; `eval-node` → T3; `ll-run` → T4); the 25 that landed at T0 (redundant with the existing perfect-solve cohort) were dropped.\n- **Dataset now at 419 rows on HF** (down from 444 after dropping the 25 T0 leaks; up from 413 at v0.1.2 with the 6 non-T0 Phase 3 additions). Tier distribution: T0×248, T1×35, T2-T5×34 each.\n- **N=10 reveals N=3 flakiness:** 91 of the 339 N=3-trial \"T0\" tasks migrate to T1+ at N=10 — 21% over-counting of the perfect cohort at N=3. The N=3 T3 bin was also degenerate (all-ties at pr=0.667); N=10 produces a real 0.800-0.854 T3 range.\n- **New filter args** on `load_environment` / `load_tasks`: `word_to_call` (positive: include only these task ids) and `exclude_word_to_call` (negative: drop these ids). Enables the \"test set = these N tasks, train set = everything else\" pattern with a single shared list. Both raise `ValueError` on unknown ids.\n- **README updates:** new tier definition with N=10 ranges (count, range, mean, median), N=3 vs N=10 comparison table, full category × tier matrix sorted by % non-T0, Phase 3 survivors table. Removed outdated T6/T7/T8 prose.\n- **Coverage gaps documented:** `conditionals` is now 18/18 at T0 (zero non-T0 coverage); `arithmetic` nearly as flat (1/20 non-T0). 10 of 15 categories have no T5 representation — the hardest cohort concentrates in `data-structures`, `metaprogramming`, `strings`, `stack-manipulation`, `forth-idioms`.\n\n### v0.1.2\n\n- **+39 new tasks** added (6 buckets explicitly aimed at T8: stack-only-memory, compile-time-meta, EXECUTE-only control flow, introspection / self-modifying, coroutines-via-rstack, bit-fiddle-without-arithmetic). After the 4-pass filter the empirical distribution was: T0×1, T1×1, T2×2, T3×8, T4×10, T6×2, T7×15. **No T8 survivors**, despite the T8-targeted generation — glm-5.1 cracked every fourth-pass task at 100 turns (12 @ 3/3, 3 @ 2/3). Honest data point: hardening the env beyond T7 requires deeper esoterica than the 6 OOD angles we tried; the previous T8 (`srev`, `roll`-from-rstack stack reversal) remains the only T8 task.\n- Dataset now at **413 rows** on HF.\n- `aggregate.py` validator widened to accept `tier ∈ [0, 8]` (added in v0.1.1, codified for v0.1.2 spec-vs-implementation hygiene).\n- Spec-vs-test alignment audit pass added to the generation pipeline: a separate sub-agent reads each candidate's prose without the reference solution and flags cases where a fresh reader couldn't predict the expected outputs. This batch had 1 alignment fix (`select-rstack-no-if` test inputs converted from JSON booleans to Forth-flag `-1`/`0` literals).\n\n### v0.1.1\n\n- **Taskset moved to private HF dataset** [`PrimeIntellect/forth-lang-tasks`](https://huggingface.co/datasets/PrimeIntellect/forth-lang-tasks). 7-column schema; loaded via `datasets.load_dataset` with `HF_TOKEN`. Override the source with `FORTH_LANG_TASKS_REPO` (accepts an alternate HF repo or a local path) for staging.\n- **All three tools now execute inside the sandbox.** `lookup_doc` migrated from a host-side BM25 retriever to `python3 /opt/forth-lang/lookup_docs.py <query>` in the sandbox; `bm25s` dropped from host deps.\n- **Team-registry Docker image** `team-clyvldofb0000gg1kx39rgzjq/forth-lang:v2` is now the only supported runtime image. Docs index is fetched fresh from the gforth manual at image-build time and baked into `/opt/forth-lang/`; no in-sandbox install or upload at rollout setup.\n- **+33 new tasks** added (4 buckets, filtered through a 4-pass solver pipeline with strict 0/3 cascade). Per-tier distribution of additions: T1×1, T2×2, T3×2, T4×6, T6×1, T7×20, **T8×1**.\n- **New T8 ceiling-probe tier.** Tasks where all four solver passes scored 0/3 but the reference solution still passes its own tests are now kept and labeled T8 (rather than dropped). `stack-reverse-no-memory` is the first T8 entry — solvable via a `roll`-based return-stack idiom, but neither gpt-4.1-mini, gpt-5-mini, gpt-5.4, nor glm-5.1 produced a single passing rollout at up to 100 turns.\n- **Solver-filter pipeline extended** to a 4th stage (`run_filter.py fourth`) and `_empirical_tier` now produces T7/T8 empirically (alongside any hand-curated `t7_curated` tasks). All cascade steps moved to strict 0/3 (a task only escalates when the previous solver scored 0 across all 3 rollouts).\n- **Sandbox controls:** `labels=` arg now reachable via `load_environment` so `prime sandbox delete -l <label>` can scope cleanups; `sandbox_name` is set internally for admin-panel display.\n- **Scripts hygiene:** `aggregate_v2.py → aggregate.py`, `reverify_v2_raw.py → reverify.py`; `sys.path.insert` hacks removed; `DEFAULT_ENDPOINTS` hardcode dropped; shared `verify_task`, `read_jsonl`, `GROUP_TO_CATEGORY` extracted into the env package.\n- **Gforth helpers** consolidated into `forth_lang/gforth.py` (`format_literal`, `build_forth_line`, `parse_stack`), shared by the sandbox runner and the offline subprocess runner.\n- **Loader hygiene:** taskset loading logic moved from `tasks/__init__.py` into `tasks/loader.py`; `__init__.py` is a pure re-export.\n- Misc bug fixes caught by Cursor's bugbot: `NameError` in `upload_to_hf.py` empty-rows guard; `KeyError: 'scenario'` in `run_filter.py` join logic (now joins on `word_to_call`); README-implementation drift on tier range and `max_submissions`; dead code in `build_docs_cache.py`.\n\n### v0.1.0\n\n- Created environment.\n","encoding":"utf-8","truncated":false,"total_bytes":11011},"status":null}