Part II · The Engine · 5 of 12MEAG

The Memory Stack

Agent memory is not one problem but a stack — working, episodic, semantic, procedural. Most production failures are layer-confusion: solving one layer's problem with another layer's tool.

Figure 1: The stack and its signature failure. Four memory layers — working (top), episodic, semantic, procedural (bottom) — each with the right tool for its own need. The orange fault line is the bug this article is about: a need that belongs to the top layer wired straight down into the bottom one. Every production memory incident is a version of this picture.
Anchor papers
Jia et al. (2026) · AI HippocampusPark et al. (2023) · Generative AgentsPacker et al. (2023) · MemGPTBehrouz et al. (2025) · Titans
26 min read5,715 words↳ Reading order: ← 4 · 6 →

[← 4] Agents That Learn on the Job took apart what writes into an agent's memory — the experience stream, online updates, and the skill/rule store Σ. This article takes apart the other half: where those memories live, how they are indexed and retrieved, and what the storage hierarchy actually looks like, from the context window out to external stores and back into the weights.

§1 · Four memories, one confused roadmap

Four incidents, all real-shaped, all the same bug.

A support agent re-read a 60-page policy PDF into its context on every single turn. It was never wrong about policy — it was slow, it was expensive, and once the window filled, it started missing the one freshly-retrieved paragraph that actually answered the question. The team had solved a working-memory problem — what deserves to be in front of the model right now — with the bluntest possible tool: keep everything.

A sales assistant "remembered" each customer by writing facts to a vector store. Ask it "what did we agree last Tuesday, and why did the customer push back?" and it returned three disconnected facts with no thread between them. It had an episodic need — events, in order, with the relations between them — and answered it with a semantic tool: a flat bag of facts that had thrown the timeline away.

A team fine-tuned a model nightly on each user's documents so it would "know" their preferences. It worked — until a preference changed, and there was no way to update one fact without retraining, and no way to point at why the model believed what it believed. They had a semantic need — stable facts that must stay updatable and attributable — and answered it with a procedural tool: weights.

An ops agent solved the same multi-step incident-triage workflow from scratch every session, because its "memory" was a chat transcript, not a procedure. Nothing it learned to do survived the session boundary. A procedural need — a reusable skill — answered with an episodic tool: a log you can read but not run.

These are not four flavors of one product. They are four different problems. Cognitive science has named them for decades — working memory (what you are holding in mind now), episodic memory (what happened, and when), semantic memory (what is true, in general), and procedural memory (how to do things). The 2026 survey closest to our problem, The AI Hippocampus (Jia et al., 2026), maps the agent-memory literature against human memory and sorts it into three paradigms: implicit memory baked into the weights, explicit memory held in retrievable stores, and agentic memory that acts on itself. The cognitive layers give us the labels; the survey gives us the human-memory mirror to check our designs against (see Figure 2).

It is worth naming the trap directly, because it is the most popular one in the field: "RAG or fine-tuning?" is not the question — it is the bug. It forces a stack of four distinct problems through a single binary, and the binary always loses, because it offers one tool where the job needs four. Almost every memory architecture worth studying is really an answer to one layer; the engineering skill is knowing which.

The architecture nearly every agent-memory system descends from makes the layering explicit. Generative Agents (Park et al., 2023) gave each character a memory stream — an append-only episodic log — plus a retrieval function scored by recency, importance, and relevance, and a reflection step that periodically reads the stream and writes higher-level inferences back as semantic facts. Episodic in, semantic out, retrieval in between. The rest of this article walks the stack one layer at a time, adds the process that moves memory between layers, follows it down into the weights, and ends where production actually lives: the constraint that decides which of these you are allowed to ship.

WORKING EPISODIC SEMANTIC PROCEDURAL what's in mind now what happened what's true how to do it context window event log indexed store skill store Σ ↔ short-term ↔ episodic ↔ semantic ↔ motor / habit volatile · recomputed each turn · cheap to change persistent · written once · costly to change
Figure 2: The memory-stack spectrum — the spine of this article. The four cognitive layers run left to right from volatile working memory (the context window) to persistent procedural memory (the skill store Σ), with the human-memory analogue under each (Jia et al., 2026; the memory-stream-to-reflection pipeline of Park et al., 2023). Reading a design means locating it on this bar; most failures come from reaching for a tool on the wrong segment.
Key Takeaway 1

Agent memory is a four-layer stack — working, episodic, semantic, procedural — and the layers are different problems, not different products. The most common production failure is layer-confusion: serving one layer's need with another layer's tool. "RAG vs. fine-tuning" is the framing that guarantees it.

§2 · Working memory is context engineering

The context window is RAM, not disk. Every token in it is recomputed and re-paid for on every turn, so the working-memory question is never "how much can I fit." It is "what earns its place this turn." Treat the window as cheap permanent storage and you get the support agent from §1: correct, slow, and eventually blind to the one new fact that mattered.

The first useful move is to stop measuring context management on one axis at a time. The Efficiency Frontier (Shen et al., 2026) points out that retrieval and compression methods are almost always reported on performance or on efficiency in isolation, which hides the only thing a practitioner cares about — the trade between them. Their framework folds both into a single cost–performance frontier: every context-management strategy is a point, the good ones sit on the frontier, and your budget picks the point, not your taste (see Figure 3). It converts a religious argument ("retrieve!" vs. "just use long context!") into an engineering one.

The second move is to notice that even an obviously good idea — drop stale stuff from the window — is regime-dependent. Masking Stale Observations (Zhang et al., 2026) studied long-horizon search agents that accumulate retrieved content across many tool calls, and swept masking across backbones from 4B to 284B parameters and three retrievers. The accuracy gain from masking traces an asymmetric inverted-U against the model's no-management accuracy: a plateau under weak retrievers (nothing worth keeping anyway), a peak when a strong retriever meets a mid-capacity model, and a sharp collapse once the model is good enough to filter for itself. Mechanistically, masking is a token-for-turn trade: it removes observations the model had largely stopped attending to and pages the agent rarely re-opens, and the freed tokens only help when they buy useful new turns. The lesson for the build: pruning is neither free nor universal, and "add a memory-compaction step" can quietly make a saturated model worse.

None of this is new in spirit. MemGPT (Packer et al., 2023) named the right mental model years earlier: virtual context management, borrowed straight from operating systems. The window is physical RAM; everything else is disk; an explicit controller pages information between an in-context working set and external stores so the model gets the illusion of unbounded memory while only ever paying for a bounded window. Working memory, done well, is a paging policy — and the next three layers are what sits out on disk.

Efficiency frontier (Shen et al., 2026) cost per turn → task performance → full context retrieval compression masking Masking gain (Zhang et al., 2026) base accuracy (no mgmt) → accuracy gain → plateau peak collapse
Figure 3: Working memory as an engineering trade. Left — every context-management strategy is a point on a cost–performance frontier; your budget selects the point (Shen et al., 2026). Right — the gain from masking stale observations is an inverted-U over a 4B–284B sweep: it peaks when a strong retriever meets a mid-capacity model and collapses once the model can filter for itself (Zhang et al., 2026). "Compact the context" is not a free win.
Key Takeaway 2

Working memory is a paging policy over the context window (MemGPT), and its decisions live on a cost–performance frontier (Shen et al.), not on a single axis. Even pruning stale context only helps in a regime — strong retriever, mid-capacity model — and backfires when the model is already saturated (Zhang et al.). Engineer the window; don't hoard it.

§3 · Episodic memory: events, not facts

Episodic memory is the layer the sales assistant got wrong. It answers what happened, when, in what order, and how the events relate — and a flat fact store throws exactly that structure away. The episodic design space is wide, and three points on it cover most of what you need.

When relations matter, store the graph. StructMem (Yao et al., 2026) starts from a real trade-off: flat memory is efficient but cannot model relations; graph memory models relations but is expensive and fragile to build. Its answer is a structure-enriched hierarchical memory that preserves event-level bindings, induces cross-event connections, and runs a periodic semantic consolidation pass. On the LoCoMo long-conversation benchmark it improves temporal-reasoning and multi-hop question answering while reducing token usage, API calls, and runtime — the relational structure pays for itself rather than costing extra (see Figure 4).

When you don't need a graph, a tiny online state will do. δ-mem (Lei et al., 2026) augments a frozen full-attention backbone with a compact associative state — a fixed-size matrix updated by a delta rule, read out as low-rank corrections to the backbone's attention. With an 8×8 online state it lifts the average score to 1.10× the frozen backbone and 1.15× the strongest competing memory baseline, reaching 1.31× on MemoryAgentBench and 1.20× on LoCoMo — with no fine-tuning, no backbone swap, and no context extension. Episodic recall does not have to be a database; it can be sixty-four numbers wired into attention.

And you rarely need to keep everything. The continual-learning literature settled this years ago. On Tiny Episodic Memories (Chaudhry et al., 2019) showed that a very small replay buffer, replayed while learning new tasks, goes a surprisingly long way toward retaining the old ones. Agent memory keeps re-discovering the same result: the marginal value of the thousandth stored episode is near zero, and a well-chosen handful captures most of the benefit. The simplest version that works in production is Reflexion (Shinn et al., 2023): after a failed attempt the agent writes a few sentences of verbal self-feedback into an episodic buffer and reads them back on the next try. "Last time this failed because the auth token was stale" is an episodic memory, and often the only one you need.

event graph relations + temporal order StructMem · LoCoMo, fewer tokens compact online state 8 × 8 fixed-size, coupled to attention δ-mem · 1.31× MemoryAgentBench tiny replay buffer a few episodes, replayed Tiny Episodic · Reflexion notes
Figure 4: Episodic memory is itself a design space. Left to right, increasing structure and decreasing footprint: an event graph that keeps relations and temporal order (StructMem, Yao et al., 2026); a fixed 8×8 online state wired into attention (δ-mem, Lei et al., 2026); and a tiny replay buffer of a few episodes — the budget-replay result from continual learning (Chaudhry et al., 2019) and Reflexion's verbal notes (Shinn et al., 2023). Pick by how much relational structure the task actually needs.
Key Takeaway 3

Episodic memory stores events with their order and relations — the thing a flat fact store destroys. Use a graph when relations carry the task (StructMem), a tiny online state when they don't (δ-mem's 8×8), and remember the continual-learning result: a small replayed buffer captures most of the value (Chaudhry et al.; Reflexion). Match structure to need, not to ambition.

§4 · Semantic and procedural memory: facts and skills

The two persistent layers are different in kind. Semantic memory holds stable facts and rules — "the customer is in the EU," "deploys freeze on Fridays." Procedural memory holds skills: the ability to do a thing without re-deriving it. Confuse them and you get the §1 incidents — facts baked into weights you can't update, or skills trapped in logs you can't run.

Facts are made, not just stored. The bridge from episodic to semantic is reflection: Generative Agents (Park et al., 2023) periodically reads its event stream and writes higher-level inferences — "I keep getting blocked on auth" — back as durable facts. That is how a log becomes knowledge. For retrieval over those facts, HippoRAG (Gutiérrez et al., 2024) is the semantic counterpart to this article's anchor: inspired by the hippocampal-indexing theory, it orchestrates an LLM, a knowledge graph, and the Personalized PageRank algorithm to integrate new knowledge the way the neocortex and hippocampus divide the labor in the brain, outperforming standard retrieval on multi-hop question answering. Semantic memory is not a flat vector store; it is an index over a graph of facts.

The frontier is memory that rewires itself. Most semantic stores are static: a fixed representation behind a fixed retrieval pipeline, which is brittle exactly when feedback and task variation keep changing what should be remembered. Rethinking Memory as Continuously Evolving Connectivity (Dong et al., 2026) proposes FluxMem, which models memory as a heterogeneous graph and refines its topology in three stages — initial connection formation, feedback-driven refinement, and long-term consolidation — repairing missing links and pruning interference as the agent runs (see Figure 5). This is the shift from memory you append to toward memory that restructures itself, and it is where the research edge of the explicit-memory layer currently sits.

Skills live in Σ. Reusing the series' notation, Σ is the skill/rule store — system-space procedural memory, the place an agent keeps things it has learned to do. Voyager (Wang et al., 2023) built the canonical version: an ever-growing library of executable skills that the agent writes, retrieves, and composes. What writes into Σ — capture, replay, distillation — is the subject of B4; here the point is structural. Procedural memory is code and rules you retrieve and run, not facts you recite, and trying to store a skill as a transcript is the ops-agent bug from §1.

[← 4] Agents That Learn on the Job dissects Σ as a learning substrate — how episodes become reusable skills via capture, replay, and rule distillation. This section only locates Σ in the stack; B4 is where it is built.

index PageRank formed pruned consolidated retrieve over a graph, not a flat list
Figure 5: Semantic memory as an evolving graph. Reflection turns episodes into facts (Park et al., 2023); retrieval runs as a hippocampal index — a PageRank walk over a knowledge graph (HippoRAG, Gutiérrez et al., 2024). At the frontier the graph's topology itself changes as the agent runs — links formed, pruned, consolidated (FluxMem, Dong et al., 2026): memory that restructures itself rather than memory you only append to.
Key Takeaway 4

Semantic memory is updatable, attributable facts — best held as an index over a graph (HippoRAG), grown from episodes by reflection (Generative Agents), and increasingly able to rewire its own topology (FluxMem). Procedural memory is the skill store Σ: things you retrieve and run (Voyager [← 4]). A fact baked into weights and a skill stuck in a log are the same mistake from opposite ends.

§5 · Consolidation: the process the stack forgets

The four layers are nouns. What moves a memory between them is a verb — and most agent stacks don't have it. A human turns a day of episodes into durable facts and smoother skills during sleep; an agent, by default, wakes up every morning exactly as it was, because nothing ever migrates from the episodic log into the semantic or procedural layers. That missing verb is consolidation, and 2026 is when it started getting engineered directly.

Language Models Need Sleep (McLeish et al., 2026) builds it as an explicit phase. The model periodically converts recent context into persistent fast weights before clearing its key–value cache: during "sleep" it runs N offline recurrent passes over the accumulated context and updates the fast weights in its state-space-model blocks through a learned local rule. This deliberately shifts compute into the sleep phase so that wake-time latency stays flat — you pay for consolidation off the critical path. Increasing the sleep duration N improves performance, with the largest gains on examples that require deeper reasoning, and on tasks where a plain Transformer and SSM–attention hybrids fail outright (cellular automata, multi-hop graph retrieval, math), sleep is what closes the gap. N here is a knob you turn, not a constant (see Figure 6).

The idea is old and biological. Brain-Inspired Replay (van de Ven et al., 2020) implements consolidation as generative replay — the network replays internally-generated representations of past experience instead of stored data — and reaches state-of-the-art on class-incremental CIFAR-100 without storing any examples. The brain protects memories by reactivating them; an agent can protect and promote its memories the same way, by replaying them offline and writing the result one layer down. The practical reading: consolidation is a scheduled process, a wake/sleep loop, not another store you buy. If nothing in your system migrates episodic experience into durable facts and skills, your agent will re-learn the same things forever — the amnesia bill of [← 4].

WAKE accumulate context KV cache grows ↑ SLEEP N offline passes context → fast weights clear KV cache consolidate (off critical path) wake latency preserved larger N → better, esp. deeper reasoning
Figure 6: Consolidation as a wake/sleep loop. The agent accumulates context while awake, then during sleep runs N offline passes that fold that context into persistent fast weights and clear the cache — keeping wake-time latency flat (McLeish et al., 2026); larger N helps most on deeper reasoning. Generative replay is the biological precedent (van de Ven et al., 2020). Without this verb, nothing migrates between layers and the agent re-learns daily.
Key Takeaway 5

Consolidation is the missing verb: a scheduled wake/sleep process that moves memory from episodic into semantic and procedural off the critical path (McLeish et al.; biological precedent in van de Ven et al.). Latency is paid during sleep, not during the user's turn. Buy stores all you like — without a consolidation loop, the layers never talk and the agent stays amnesic.

§6 · The architectural pole: memory inside the model

Everything so far put memory outside the model — stores you retrieve from. The opposite pole puts memory inside, as a runtime primitive of the architecture itself. This is where the stack folds back onto the weights, and it is the most active research front of all.

Titans (Behrouz et al., 2025) adds a neural long-term memory module that learns to memorize historical context at test time. The framing is clean and worth keeping: attention is short-term memory — precise but bounded by the window — and the neural memory module is long-term — persistent and compressed, holding what won't fit. Memory stops being an external database call and becomes a component of the forward pass. TTT layers (Sun et al., 2024) push the same instinct further: make the hidden state itself a small machine-learning model, and make the update rule a step of self-supervised learning, so the state literally learns at test time. At 125M–1.3B parameters, TTT keeps reducing perplexity as context grows, where Mamba plateaus after 16k tokens — the expressivity of the memory is the expressivity of its hidden state (see Figure 7).

The hard part of in-model memory is not what to forget but how to edit a compressed state without scrambling what's already there. Gated DeltaNet-2 (Hatamizadeh et al., 2026) makes the crucial distinction: earlier delta-rule models tie erasing old content and writing new content to a single scalar gate, which forces one knob to do two jobs. Gated DeltaNet-2 decouples them into a channel-wise erase gate and a channel-wise write gate, so the model can clear an association and commit a new one independently. Erase ≠ write — and that is the same machinery, viewed from the memory side, that [← A6] examined from the plasticity side: A6 showed architecture can solve forgetting with a learned decay gate; this is how that decayable state decides what to overwrite.

At the far end of the pole, you can skip retrieval entirely by distilling a whole document into a parameter patch — a LoRA adapter generated in a single forward pass — so the model answers from weights without ever re-reading the context. The published Continual Intelligence series treats this inference-time weight-editing thread [← A6]; for our purposes it marks the boundary of the stack, where "explicit store" and "implicit weights" meet.

[→ C2] Whether long-term memory ultimately belongs in weights and runtime rather than in external stores is a paradigm bet, not a settled engineering choice. The Long Bet essay [→ C2] takes up that wager — Titans, test-time memory, and neural-memory architectures — on the 3/5/10-year horizon; here it is one pole of a design space, not a prediction.

in the context in a fixed state in the weights external · explicit recurrent · test-time implicit · distilled Titans · TTT layers (learn at test time) Doc→LoRA patch [← A6] §2–§4 stores Gated DeltaNet-2: erase ≠ write erase gate write gate two channel-wise knobs, not one scalar
Figure 7: The architectural pole. Memory can live in the context, in a fixed recurrent state learned at test time (Titans, Behrouz et al., 2025; TTT layers, Sun et al., 2024), or distilled into the weights [← A6]. Editing a compressed state cleanly requires separating the two jobs one scalar gate used to do — a channel-wise erase gate and a channel-wise write gate (Gated DeltaNet-2, Hatamizadeh et al., 2026). Erase ≠ write.
Key Takeaway 6

The other end of the stack puts memory inside the model: a long-term neural memory beside short-term attention (Titans), a hidden state that learns at test time (TTT), and compressed state you can edit only by separating erase from write (Gated DeltaNet-2 [← A6]). Powerful, but opaque — and that opacity is exactly what the next section will make disqualifying.

§7 · The audit constraint: memory that can testify

Now the constraint the research mostly ignores, and the one that actually decides production. In regulated domains — underwriting, claims adjudication, clinical review, tax examination — the most capable memory architecture on the benchmark is routinely the one nobody ships. The reason is not conservatism. It is that these deployments answer to someone who can ask "why."

Srinivasan's (2026) practitioner analysis of enterprise agents names the hidden requirement precisely. Regulated decisioning is load-bearing on four systems properties: deterministic replay (a denied applicant can be re-scored and the same decision justified), auditable rationale (a regulator can inspect the reasoning trail), multi-tenant isolation (one applicant's data cannot leak into another's decision), and statelessness for horizontal scale (thousands of concurrent decisions cannot bottleneck on one shared mutable memory). Sophisticated stateful memory violates at least one of these by construction, and the margin of violation compounds as the deployment matures. This is why regulated stacks run "weaker" retrieval pipelines on purpose: a vector store happens to satisfy all four as a side-effect of its simplicity (see Figure 8).

The constructive proposal is Deterministic Projection Memory (DPM): treat the trajectory as an append-only immutable event log, and at decision time compute a single task-conditioned projection — a structured extraction of facts, reasoning, and compliance notes within a budget — at temperature zero. The log is the single source of truth; the projection is pure, so replaying the same log under the same model version reproduces the same memory view (up to residual API nondeterminism). In a ten-case study across three memory budgets, DPM matches stateful summarization memory at generous budgets and pulls ahead only when the budget binds: at a 20× compression ratio, it improves factual precision by +0.52 (Cohen's h = 1.17) and reasoning coherence by +0.53 (h = 1.13). It is also 7–15× faster, because it makes one decision-time call instead of N consolidation calls across the trajectory — and that same asymmetry is the audit story in miniature: DPM exposes a single nondeterministic call, while summarization exposes N compounding ones.

This reframes the whole article's question. It is not "what is the most capable memory" but "what is the best memory that can testify" — that you can replay deterministically and explain to someone with the authority to overrule you. Read back through the stack, the audit column quietly disqualifies the §6 pole for these use cases: a memory living in the weights is, almost by definition, a memory that cannot point at why. The most powerful layer and the most deployable layer are often not the same layer.

[→ 11] Observability & Ops takes the replay-and-explain requirement all the way down to event-sourced architecture — how to build the append-only log, the deterministic projection, and the audit trail that lets an agent reconstruct and justify any decision after the fact.

Memory designDeterministic replayAuditable rationaleMulti-tenant isolationStateless at scaleCan it testify?
Retrieval pipeline
vector / graph store
✅ deterministic given index ✅ shows what was retrieved ✅ per-tenant index ✅ stateless lookups ✅ the reason enterprise picks it
Summarization memory
stateful, path-dependent
❌ N compounding calls ⚠️ summary hides what was dropped ⚠️ shared running state ❌ mutable state per session ❌ path-dependent
In-model memory
Titans / TTT-style (§6)
❌ opaque hidden state ❌ rationale is in the weights ⚠️ shared parameters ❌ stateful by design ❌ cannot point at "why"
Deterministic Projection Memory
append-only log + 1 projection
✅ pure projection ✅ structured + compliance notes ✅ per-log ✅ one decision-time call ✅ designed for it
Figure 8: The audit-constraint table — which memory designs can testify. The four load-bearing properties of regulated decisioning, and how each design satisfies or violates them (Srinivasan, 2026). Retrieval pipelines pass as a side-effect of simplicity; stateful and in-model memory fail on replay and explainability; Deterministic Projection Memory is engineered to pass while matching summarization quality (and beating it at a 20× budget: +0.52 precision, +0.53 coherence; 7–15× faster). Capability and deployability are different columns.
Key Takeaway 7

In regulated deployment, memory must testify: deterministic replay, auditable rationale, multi-tenant isolation, statelessness (Srinivasan, 2026). Stateful and in-model memory violate these by construction, which is why enterprises ship retrieval on purpose. Deterministic Projection Memory — an append-only log plus one pure projection — keeps the audit properties and matches stateful quality, beating it (+0.52 / +0.53) only when the budget binds. Ask "what can I replay and explain," not "what scores highest."

§8 · Choosing your stack

Every incident in §1 was the same mistake: reaching across the stack for a tool from the wrong row. The fix is not a better memory product; it is a habit of locating the need first and then picking the layer that owns it. Figure 9 is that habit as a table — each memory need mapped to its layer and a concrete tool, scored on the four things that decide a production design: persistence, audit, latency, and cost. It is meant to be useful read on its own.

The needLayerPersistenceAuditLatencyCostReach for
Hold what's relevant this turn Working turn full (it's visible) instant $$ recomputed context engineering — paging, masking, compression
Recall events across a session Episodic session high low $ event graph (relations) or compact online state (δ-mem)
Stable, updatable, citable user facts Semantic long high (citable) low $ indexed graph store (HippoRAG) + reflection
Reuse a multi-step skill Procedural (Σ) long medium low $ executable skill library [← 4] (Voyager)
Decide under audit (regulated) Audit-first append-only log must testify low $ Deterministic Projection Memory / retrieval (§7)
Fold long context into the model Architectural in-weights low — can't testify low after warm-up $$ upfront in-model memory (Titans / TTT) — only where audit is moot
Figure 9: The deployable artifact — a decision table over persistence × audit × latency × cost. Find your need in the left column, read off the layer that owns it, and reach for the tool on the right; the four middle columns tell you what you are trading. The §1 incidents are what happens when you read off the wrong row. Every entry traces to a paper in this article.

Three rules turn the table into a default policy:

  1. Start at working memory and descend only when the need outlives the boundary above it. Turn → session → user → deployment. Each step down costs persistence machinery; don't pay for it until the need actually crosses the boundary.
  2. If a human might ask "why," you are in the audit column. Choose the most replayable tool that meets the need, not the most capable one (§7). The benchmark winner is frequently disqualified before it reaches staging.
  3. Consolidation is a process you schedule, not a store you buy. If nothing migrates episodic experience into semantic facts and procedural skills (§5), the stack never compounds and the agent re-learns nightly — the amnesia bill of [← 4].
Key Takeaway 8 — the artifact

Keep Figure 9 next to the codebase. Memory failures are almost never "we need a better store"; they are "we solved a turn-scoped need with a deployment-scoped tool," or the reverse. Name the need, find its row, pay only for the persistence and audit that row demands — and schedule the consolidation that lets the rows feed each other.

Final Key Takeaways

  1. Memory is a stack, not a setting. Working, episodic, semantic, procedural are four problems; "RAG vs. fine-tuning" is the framing that hides them.
  2. The map, the economics, and the frontier now exist. The human-memory mapping organizes the layers (Jia et al.), the efficiency frontier prices working memory (Shen et al.), and self-restructuring connectivity is the research edge (Dong et al.).
  3. Consolidation is the missing verb. Without a wake/sleep loop, the layers never feed each other (McLeish et al.; van de Ven et al.).
  4. Auditability is the real design axis. The best memory you can ship is the best one that can replay and explain itself (Srinivasan, 2026).

What comes next

Memory decides what an agent knows. The next article is about what it can do. B6 — Tools, Skills, and the Action Interface measures the knowing–doing gap directly and quantifies the "MCP tax": the hidden per-turn overhead that stateless, eager schema injection imposes between an agent's intent and its action. The stack we just built is the input to that interface — semantic facts and procedural skills are only worth storing if the agent can act on them cleanly. And the deeper wager underneath §6 — whether durable memory ultimately belongs in the runtime and the weights rather than in external stores — is the subject of [→ C2], where it is treated not as an engineering choice but as a bet with a horizon.

[→ 6] Tools, Skills, and the Action Interface — from what the agent remembers to what it can do, and the cost of wiring the two together. · [→ C2] the runtime-as-memory paradigm bet (Titans, neural-memory architectures), on the 3/5/10-year horizon.

References

  1. Jia, Z., Li, J., Kang, Y., Wang, Y., Wu, T., Wang, Q., Qi, S., Liang, Y., He, D., Zheng, Z., & Zhu, S.-C. (2026). The AI Hippocampus: How Far Are We From Human Memory? Preprint (OpenReview). arXiv:2601.09113.
  2. Park, J. S., O'Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., & Bernstein, M. S. (2023). Generative Agents: Interactive Simulacra of Human Behavior. Proceedings of the 36th ACM Symposium on User Interface Software and Technology (UIST '23). arXiv:2304.03442.
  3. Packer, C., Fang, V., Patil, S. G., Lin, K., Wooders, S., Stoica, I., & Gonzalez, J. E. (2023). MemGPT: Towards LLMs as Operating Systems. arXiv preprint. arXiv:2310.08560.
  4. Shen, B., Jin, L., Cai, H., Hu, L., & Xin, Y. (2026). The Efficiency Frontier: A Unified Framework for Cost–Performance Optimization in LLM Context Management. arXiv preprint. arXiv:2605.23071.
  5. Zhang, H., Xu, Q., Li, Z., Zhang, L., Jiang, P., Zhang, Y., & McAuley, J. (2026). Masking Stale Observations Helps Search Agents — Until It Doesn't: A Regime Map and Its Mechanism. arXiv preprint. arXiv:2606.00408.
  6. Yao, Y., Zhu, Y., Du, L., & Deng, S. (2026). StructMem: Structured Memory for Long-Horizon Behavior in LLMs. arXiv preprint. arXiv:2604.21748.
  7. Lei, J., Zhang, D., Li, J., Wang, W., Fan, K., Liu, X., Ma, X., Chen, B., & Poria, S. (2026). δ-mem: Efficient Online Memory for Large Language Models. arXiv preprint. arXiv:2605.12357.
  8. Chaudhry, A., Rohrbach, M., Elhoseiny, M., Ajanthan, T., Dokania, P. K., Torr, P. H. S., & Ranzato, M. (2019). On Tiny Episodic Memories in Continual Learning. arXiv preprint. arXiv:1902.10486.
  9. Shinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. Advances in Neural Information Processing Systems 36 (NeurIPS 2023). arXiv:2303.11366.
  10. Dong, B., Zhu, H., Huang, R., Yu, G., Wei, Y., Zheng, G., Xiong, F., Wang, H., Chen, H., & Zhang, N. (2026). Rethinking Memory as Continuously Evolving Connectivity. arXiv preprint. arXiv:2605.28773.
  11. Gutiérrez, B. J., Shu, Y., Gu, Y., Yasunaga, M., & Su, Y. (2024). HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models. Advances in Neural Information Processing Systems 37 (NeurIPS 2024). arXiv:2405.14831.
  12. Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L., & Anandkumar, A. (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv preprint. arXiv:2305.16291.
  13. McLeish, S., Goldstein, T., & Fanti, G. (2026). Language Models Need Sleep. arXiv preprint. arXiv:2605.26099.
  14. van de Ven, G. M., Siegelmann, H. T., & Tolias, A. S. (2020). Brain-Inspired Replay for Continual Learning with Artificial Neural Networks. Nature Communications, 11, 4069.
  15. Behrouz, A., Zhong, P., & Mirrokni, V. (2025). Titans: Learning to Memorize at Test Time. arXiv preprint. arXiv:2501.00663.
  16. Sun, Y., Li, X., Dalal, K., Xu, J., Vikram, A., Zhang, G., Dubois, Y., Chen, X., Wang, X., Koyejo, S., Hashimoto, T., & Guestrin, C. (2024). Learning to (Learn at Test Time): RNNs with Expressive Hidden States. arXiv preprint. arXiv:2407.04620.
  17. Hatamizadeh, A., Choi, Y., & Kautz, J. (2026). Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention. arXiv preprint. arXiv:2605.22791.
  18. Srinivasan, V. (2026). Stateless Decision Memory for Enterprise AI Agents. Practitioner analysis (arXiv preprint). arXiv:2604.20158.