KG-2.20 — Mementified Context Management¶
Assimilated from Memento: Teaching LLMs to Manage Their Own Context (Kontonis et al., Microsoft Research AI Frontiers, 2026). Extends KG-2.1 (Tiered Memory & Context).
What it is¶
A memento is not a human summary — it is a lemma: a terse, information-dense compression of a completed reasoning/conversation block that preserves exact formulas, key intermediate values, commands and their outcomes, and the current execution state, so the model can reason forward from the memento alone and the raw block can be evicted from the live context. Running this on a multi-turn agent produces the paper's sawtooth context profile: tokens climb while a block is in progress, then drop sharply when the block is compressed to a memento and evicted.
The paper teaches this skill into a model via SFT + a vLLM KV-cache fork (≈2–2.5× peak KV reduction). agent-utilities runs hosted/API models, so we adopt the pattern at the orchestration layer — which the paper itself flags as the prime next application: "Terminal and CLI agents are naturally multi-turn, where each action-observation cycle is laid out as a natural block."
The five pieces (MEM-0…MEM-4)¶
| Piece | Where | |
|---|---|---|
| MEM-0 | Canonical memento_compressor.py (strangled out of the 1.8k-line agent_context.py); fixes the previously silently-broken from .memento_compressor import … in observer.py/memory_engine.py that left the memento write path dead |
knowledge_graph/memory/memento_compressor.py |
| MEM-1 | Live MementoCompaction capability — on before_model_request, when the running history exceeds budget, segment → compress completed blocks → evict raw blocks, keeping mementos + current block. Default ON in agent/factory.py (also covers the RLM multi-turn repl loop via its factory agent) |
capabilities/memento.py |
| MEM-2 | Judge-refine loop — compressor→judge→recompress on a six-dimension rubric (formulas-verbatim, values, methods, validation, no-hallucination, result-first), τ=8/10, ≤2 iters. The paper measured single-shot mementos at 28% rubric pass vs 92% after two judge passes |
memento_compressor.compress_to_memento |
| MEM-3 | Semantic-boundary segmentation — boundary_score (never cut mid-derivation; cut at turn / action↔observation boundaries) + segment_into_blocks (min-block floor, no tiny danglers). New memento_blocks ContextCompactor strategy is the LLM-free path |
memento_compressor.segment_into_blocks, agent_context.ContextCompactor |
| MEM-4 | Lossless recoverability — each evicted block is persisted as an EvictedBlock node linked Memento -[:SUMMARIZES]-> EvictedBlock; recover_evicted_block() re-fetches it on demand |
memento_compressor._persist_memento / recover_evicted_block |
Honest limitation (why this is not the paper, end-to-end)¶
The paper's headline result is a dual information stream: because masking happens in-place inside
one forward pass, the memento's KV-cache entries retain implicit information from the block they
replaced — removing that channel costs −15pp (their "restart mode"). We do not control the
inference engine's KV cache, so an orchestration-level memento is restart mode and cannot
reproduce that implicit channel. MEM-4's lossless expand/recover is the substitute (the evicted
block is re-fetchable), not an equivalent. We also do not train models (the SFT curriculum and the
OpenMementos data-gen pipeline are out of scope; noted for any future RLM-role fine-tuning).
Wiring (Wire-First)¶
Entry point → agent/factory.py registers MementoCompaction in agent_capabilities
(memento_compaction=True by default) → pydantic-ai before_model_request hook receives
ModelRequestContext.messages (the list actually sent to the model) → eviction transform. The memento
write path is also reachable from mcp/kg_server.py via observer.observe_transcript. Verified
by check_wiring.py (passed, 0 violations) and a *_live_path test that exercises the capability on
real ModelMessage objects.
Success metrics¶
- Peak context tokens/session −≥40% vs no-compaction on a multi-turn run at ≥95% task-success parity (the live-path test shows −77% on a synthetic 8-cycle trajectory).
- Memento acceptance (rubric ≥8/10) ≥90% within ≤2 judge iterations (paper: 92%).
- 100% of evicted blocks recoverable (lossless pointer present).
Tests¶
tests/knowledge_graph/memory/test_kg_2_20_memento.py (judge-refine, segmentation, lossless recall,
capability live-path, factory-default-ON) + test_memento_compressor.py.