{"data":{"kind":"file","path":"README.md","version_id":"fyteorolqjm6ip6gff6cf09k","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2692,"modified_at":"2026-02-06T18:58:10.495000","content_hash":"659dd4b1f9a71b08e8333e12c40770b16be566f3412961e1493155c2d39527a1"},"entries":[],"content":"# distributed-inventory\r\n\r\n### Overview\r\n- **Environment ID**: `distributed-inventory`\r\n- **Short description**: A multi-turn \"Active Context\" challenge where an agent must maintain a dynamic inventory state (GET/DROP) across a long, noisy stream with total context amnesia between steps.\r\n- **Tags**: memory, context-management, active-reasoning, synthetic, multi-turn\r\n\r\n### Datasets\r\n- **Primary dataset(s)**: Procedurally generated synthetic data.\r\n- **Source links**: N/A (Generated on-the-fly via `NoiseGenerator`)\r\n- **Split sizes**: Infinite/Procedural (Default: 100 episodes)\r\n\r\n### Task\r\n- **Type**: Multi-turn\r\n- **Parser**: `XMLParser` (fields=[\"answer\"])\r\n- **Rubric overview**:\r\n    - **Reward**: Binary (1.0 or -1.0).\r\n    - **Criteria**: The final reported inventory must match the ground truth **set** exactly. Order and capitalization are normalized, but missing or extra items result in failure.\r\n    - **Key Metrics**: `episode_length`, `final_item_count`.\r\n\r\n### Research Context\r\nThis environment is designed to isolate **Active Context Management** capabilities. Unlike standard retrieval tasks:\r\n1.  **Total Amnesia**: The model's context window is effectively cleared after every turn.\r\n2.  **Entropy**: The model must actively rewrite its internal state (the inventory) into its output (\"scratchpad\") every turn to persist it.\r\n3.  **Dynamic State**: Unlike append-only tasks, the model must handle non-monotonic updates (e.g., `[DROP: key]`) to verify it isn't just blindly copying text.\r\n\r\n### Quickstart\r\nRun an evaluation with default settings:\r\n\r\n```bash\r\nprime eval run distributed-inventory\r\n```\r\n\r\nConfigure difficulty (e.g., longer episodes, more noise, more operations):\r\n\r\n```bash\r\nprime eval run distributed-inventory \\\r\n  -m gpt-4o \\\r\n  -n 20 \\\r\n  -a '{\"n_chunks\": 20, \"n_ops\": 10, \"chunk_size\": 2000}'\r\n```\r\n\r\n### Environment Arguments\r\n\r\n| Arg | Type | Default | Description |\r\n| --- | --- | --- | --- |\r\n| `chunk_size` | int | `1000` | The number of noise words generated per turn. |\r\n| `n_chunks` | int | `10` | The number of turns (memory wipes) in the episode. |\r\n| `n_ops` | int | `5` | The total number of `[GET]` or `[DROP]` operations distributed across the episode. |\r\n| `seed` | int | `42` | Random seed for noise and operation generation. |\r\n| `max_episodes` | int | `100` | The number of unique episodes to generate. |\r\n\r\n### Metrics\r\n\r\n| Metric | Meaning |\r\n| --- | --- |\r\n| `correct_answer` | 1.0 if the final set of items exactly matches ground truth; -1.0 otherwise. |\r\n| `episode_length` | Total number of chunks/turns survived. |\r\n| `final_item_count` | The size of the inventory at the end of the episode (proxy for complexity). |\r\n","encoding":"utf-8","truncated":false,"total_bytes":2692},"status":null}