{"data":{"kind":"file","path":"README.md","version_id":"fwcdfxfk1x9cs2ym36ycaf6s","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2566,"modified_at":"2026-02-04T05:11:23.836000","content_hash":"ebed4bae656b24dbd95934d944f08f86556c07e0587825c4fd8e2482f8b98cb4"},"entries":[],"content":"# shadow-state\r\n\r\n### Overview\r\n- **Environment ID**: `shadow-state`\r\n- **Short description**: A deterministic \"Entropy Airlock\" environment that tests Active Context Management by forcing agents to maintain a limited-capacity state (notebook) while processing a high-entropy, long-horizon log stream.\r\n- **Tags**: `reasoning`, `memory`, `active-context`, `synthetic`, `long-horizon`\r\n\r\n### Datasets\r\n- **Primary dataset(s)**: **Procedurally Generated Log Stream**. A deterministic sequence of system logs mixing signal (variable updates, logic definitions, expiry events) with noise, generated at runtime.\r\n- **Source links**: N/A (Generated via `shadow_state.LogGenerator`)\r\n- **Split sizes**: Infinite/Configurable (Default episode length: 1000 steps)\r\n\r\n### Task\r\n- **Type**: `multi-turn`\r\n- **Parser**: `Custom JSON Parser` (Extracts structured actions: `WRITE`, `UPDATE`, `FORGET`, `NO_OP`)\r\n- **Rubric overview**: Evaluation is based on the **Hamming Distance** between the agent's notebook and the ground-truth Oracle state. Rewards penalize missing keys, hallucinated keys, and stale data (garbage collection failures).\r\n\r\n### Quickstart\r\nRun an evaluation with default settings:\r\n\r\n```bash\r\nprime eval run shadow-state\r\n```\r\n\r\nConfigure model and sampling:\r\n\r\n```bash\r\nprime eval run shadow-state \\\r\n  -m gpt-4o-mini \\\r\n  -n 20 -r 3 -t 1024 -T 0.7 \\\r\n  -a '{\"max_chars\": 500, \"noise_ratio\": 0.95, \"seed\": 42}'\r\n```\r\n\r\nNotes:\r\n\r\n* Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.\r\n\r\n### Environment Arguments\r\n\r\nDocument any supported environment arguments and their meaning. Example:\r\n\r\n| Arg | Type | Default | Description |\r\n| --- | --- | --- | --- |\r\n| `max_chars` | int | `1536` | The hard limit on characters in the notebook (visual text length). |\r\n| `noise_ratio` | float | `0.75` | Probability (0.0-1.0) that a log entry is irrelevant noise. |\r\n| `horizon` | int | `1000` | Total number of log steps in one episode. |\r\n| `seed` | int | `None` | Random seed for log generation. Set this for deterministic reproducibility. |\r\n\r\n### Metrics\r\n\r\nSummarize key metrics your rubric emits and how they’re interpreted.\r\n\r\n| Metric | Meaning |\r\n| --- | --- |\r\n| `reward` | Normalized score (0.0 to 1.0) based on state accuracy per step. |\r\n| `hamming_distance` | Raw count of state discrepancies (Missing Keys + Hallucinated Keys). |\r\n| `value_fidelity` | % of matching values for keys that are correctly present. |\r\n| `staleness_ratio` | % of the notebook occupied by expired/dead data (Garbage Collection failure). |\r\n","encoding":"utf-8","truncated":false,"total_bytes":2566},"status":null}