{"data":{"kind":"file","path":"README.md","version_id":"p0l8u1x77i6lr4s2p0p1tk64","entry":{"name":"README.md","path":"README.md","is_directory":false,"size":2390,"modified_at":"2026-06-20T20:20:57.802000","content_hash":"6d72396ed139907a630ccc7b8682c3e183f17eb88274bd2da304b7ee86ec63b6"},"entries":[],"content":"# supersede\n\n**Train and evaluate agents to use the *current* fact, not the *stale* one.**\n\nA bounded-memory environment over multi-session interactions: the agent sees one\nsession at a time and maintains a capped notes memory (it never re-sees raw\nsessions), then must answer a question using the current value of a fact that\nwas updated along the way.\n\n## The failure it targets\n\nOn LongMemEval's `knowledge-update` questions, giving an agent bounded memory\ninstead of full context drops supersession accuracy sharply — and the gap\nsurvives on the frontier model:\n\n| Model | Full-context | Bounded memory |\n| --- | --- | --- |\n| gpt-4.1-mini | 82% | 63% |\n| gpt-4.1 | 91% | 64% |\n| gpt-5.4 | 92% | **77%** |\n\nEven gpt-5.4 loses 15 points (paired McNemar p=0.0033) and fails ~23% of\nsupersession questions under bounded memory, while full-context saturates near\n92%. The bottleneck is memory maintenance, not comprehension. (Details:\n`docs/findings/` in the repo.)\n\n## Usage\n\n```bash\nprime env install supersede\n# bounded memory (the failure regime)\nprime eval run supersede -m openai/gpt-4.1-mini -a '{\"max_examples\": 78}'\n# full-context upper bound (for the gap)\nprime eval run supersede -m openai/gpt-4.1-mini -a '{\"full_context\": true}'\n```\n\nThe environment auto-downloads the LongMemEval knowledge-update data\n(MIT license) on first run. Arguments to `load_environment`:\n\n| arg | default | meaning |\n| --- | --- | --- |\n| `question_type` | `knowledge-update` | LongMemEval subset |\n| `max_examples` | `None` | cap on tasks |\n| `budget` | `300` | character cap on the agent's notes memory (bounded mode) |\n| `full_context` | `False` | upper-bound mode: all sessions in context, single turn |\n\n## Reward\n\n- `answered_current` (+1): the final answer conveys the current/gold value\n  (programmatic, ungameable matcher; no API needed).\n- `stale_penalty` (-1): the answer asserts a known superseded value — active\n  only when the task ships `stale_values` (synthetic timelines; LongMemEval is\n  gold-only).\n\n## Status\n\nValidated end-to-end under `verifiers` 0.1.14 against OpenAI: all 78\nknowledge-update rollouts terminate cleanly and the environment reports\n**57.7%** accuracy for gpt-4.1-mini (programmatic matcher), consistent with the\noffline harness's 63% (LLM judge). The remaining step is the Hub push\n(`prime env push`, which authenticates under your Prime Intellect account).\n","encoding":"utf-8","truncated":false,"total_bytes":2390},"status":null}