I — Foundations · 7 min

What Is AI Memory?

AI memory is an external, editable store a model retrieves from across sessions, and it is not one thing: working, episodic, semantic, procedural, and document memory each store, surface, and fail differently.

Close the tab and the model forgets you. Open it again tomorrow and it reintroduces itself, re-asks what you already told it, and offers advice you corrected last week. The fix gets called "memory," which makes it sound like a single feature you switch on. It is not.

A model's own knowledge lives in its weights, frozen at training time. You cannot edit it, and it knows nothing about you. AI memory is the layer you add on top: an external, editable store the model can retrieve from across sessions, plus the machinery that decides what to write down and what to surface at the right moment. The 2020 RAG paper drew this line precisely. Parametric memory is baked into the weights and cannot change without retraining. Non-parametric memory is an external index you can update, delete, and reorder at any time. Every memory system since is a descendant of that second idea: keep the facts outside the model, fetch them when they help.

So the storage is the easy half. The hard half is the machinery: deciding what is worth keeping, resolving the new fact against the old one, and surfacing only what matters for the current turn without burying it. Get that wrong and "memory" makes the model worse, not better, by feeding it stale or irrelevant context.

#Memory is not one thing

The most common mistake is treating memory as a single bucket you dump everything into. Builders keep rediscovering, independently, that it splits into distinct kinds with different storage, different write triggers, and different ways of breaking. The cleanest shared vocabulary comes from the cognitive-memory split that frameworks like LangMem adopted (semantic, episodic, procedural), extended with the two kinds the community names in practice: working memory and document memory.

Type	Storage primitive	Write trigger	Read path	Decay	Failure mode
Working	The live context window; a KV store, log, or sqlite for task state	Every turn or step	Already in context, or a direct lookup	Cleared when the task ends; evicted when context fills	Overflow; running an extraction model on it is wasteful and slow
Episodic	Append-only log or "memory stream" of past events	After an event or turn	Ranked recall by recency, importance, and relevance	Grows unbounded; consolidated into facts; recency-decayed	The D&D bot forgets the NPC from five sessions ago
Semantic	Fact records in a vector store, or one structured profile	Extracted from conversation, or user-approved	Semantic search, usually hybrid with keyword	Should supersede, not just append	Stale facts surface as still true
Procedural	Behavioural rules, encoded in the system prompt	Prompt optimisation from past trajectories	Always present; not retrieved	Rarely changes; updated by re-tuning the prompt	Agent re-derives how to use knowledge every run
Document	Vector index of document chunks, plus keyword	Ingest and index a corpus, offline	Retrieve top-K, stuff into the prompt	Re-index when the corpus changes	Returns the right-looking chunk, not the right state

Figure 1. The five kinds of AI memory. Each row stores, surfaces, and fails differently, which is why no single store serves all of them.

A few of these deserve a closer look, because the differences are where systems go wrong.

Working memory is the live scratchpad: the current task, tool outputs, variables in flight. Its natural home is the context window itself, and for anything that spills over, plain lossless storage (a key-value store, an append-only log, a sqlite file) beats anything clever. One builder benchmarked memory frameworks against a plain long-context baseline and reported them running 14 to 77 times more expensive and roughly 30% less accurate at recall, because they ran a model to extract and normalise facts on every single message. That cost is the right shape for durable preferences and the wrong shape for execution state, which just needs to be stored and read back verbatim.

Document memory is RAG, and it is the type most often mislabelled as memory in full. Indexing your books or a knowledge base so the model can look things up is genuinely useful, but it answers "what do the documents say?" rather than "what do I remember about you?" A vector index returns the chunk that looks most similar to the query, which is not the same as tracking how a fact changed over time. That gap is the entire subject of RAG vs memory.

Semantic memory is where the field's hardest problem lives: a fact you stored last month may be wrong today. The classic example: a user loved Adidas, the shoes broke, they switched to Puma. A flat store that only appends keeps all three and happily returns "I love Adidas" because it scores highest. Real semantic memory has to supersede, not just accumulate. See smart extraction and temporal memory.

Procedural memory is the odd one out: it is usually not stored as data rows at all. It is the behavioural rules and skills you encode in the system prompt and improve over time. One recurring community insight is that teams over-invest in storing facts and under-invest here, so the agent re-derives how to use what it knows on every run.

#The human-memory analogy is useful, and dangerous

The split above is borrowed honestly from cognitive science. Humans really do have distinct memory systems: short-term versus long-term, episodic (what happened) versus semantic (what you know). Recent surveys bridge that vocabulary directly into AI design, and it is the reason the categories feel intuitive. Reaching for "episodic" and "semantic" is good, because it forces you to ask what kind of thing you are storing before you pick a database.

The danger is taking the metaphor literally. Human memory is associative, lossy, and self-modifying: recalling something can quietly rewrite it. Those are features for a person and bugs for a system you want to trust for reliable recall. As one practitioner put it, mapping LLM memory onto human memory one-to-one is a mistake, because AI memory has to be intentionally managed in a way the brain never is. And note that humans do not have a single general-purpose memory layer either, which is the whole point: even the thing you are copying is not one bucket. Use the analogy for vocabulary and intuition. Do not use it to justify a mechanism.

#What follows from this

Once you accept that memory is plural, the hard work is no longer "which database?" It is the curation and ranking around the store: deciding what is salient enough to keep, reconciling a new fact against a contradicting old one, forgetting what has gone stale, and ordering what you surface so the model actually uses it. Storage is largely solved. The rest of this guide is about the parts that are not. Start with context vs memory if you are wondering whether a bigger context window makes all of this moot.

#References

Lewis et al., 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (the parametric vs non-parametric memory distinction). https://arxiv.org/abs/2005.11401
LangMem documentation, LangChain (the semantic / episodic / procedural memory taxonomy and the profile-vs-collection design lens). https://langchain-ai.github.io/langmem/
Zhang et al., 2024. A Survey on the Memory Mechanism of LLM-based Agents (memory sources and the write → consolidate → read operation view). https://arxiv.org/abs/2404.13501
Wu et al., 2025. From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs (the human-to-AI memory bridge and its taxonomy). https://arxiv.org/abs/2504.15965
Park et al., 2023. Generative Agents: Interactive Simulacra of Human Behavior (the episodic memory stream and recency + importance + relevance ranking). https://arxiv.org/abs/2304.03442
Community discussion: "Universal 'LLM memory' is mostly a marketing term" and "Universal LLM Memory Doesn't Exist," r/LLMDevs and r/LocalLLaMA (the working / episodic / semantic / document split and the cost-of-LLM-on-write claim, reported by builders, not peer-reviewed).