What Is AI Memory?
AI memory is an external, editable store a model retrieves from across sessions, and it is not one thing: working, episodic, semantic, procedural, and document memory each store, surface, and fail differently.
Close the tab and the model forgets you. Open it again tomorrow and it reintroduces itself, re-asks what you already told it, and offers advice you corrected last week. The fix gets called "memory," which makes it sound like a single feature you switch on. It is not.
A model's own knowledge lives in its weights, frozen at training time. You cannot edit it, and it knows nothing about you. AI memory is the layer you add on top: an external, editable store the model can retrieve from across sessions, plus the machinery that decides what to write down and what to surface at the right moment. The 2020 RAG paper drew this line precisely. Parametric memory is baked into the weights and cannot change without retraining. Non-parametric memory is an external index you can update, delete, and reorder at any time. Every memory system since is a descendant of that second idea: keep the facts outside the model, fetch them when they help.
So the storage is the easy half. The hard half is the machinery: deciding what is worth keeping, resolving the new fact against the old one, and surfacing only what matters for the current turn without burying it. Get that wrong and "memory" makes the model worse, not better, by feeding it stale or irrelevant context.
#Memory is not one thing
The most common mistake is treating memory as a single bucket you dump everything into. Builders keep rediscovering, independently, that it splits into distinct kinds with different storage, different write triggers, and different ways of breaking. The cleanest shared vocabulary comes from the cognitive-memory split that frameworks like LangMem adopted (semantic, episodic, procedural), extended with the two kinds the community names in practice: working memory and document memory.
| Type | Storage primitive | Write trigger | Read path | Decay | Failure mode |
|---|---|---|---|---|---|
| Working | The live context window; a KV store, log, or sqlite for task state | Every turn or step | Already in context, or a direct lookup | Cleared when the task ends; evicted when context fills | Overflow; running an extraction model on it is wasteful and slow |
| Episodic | Append-only log or "memory stream" of past events | After an event or turn | Ranked recall by recency, importance, and relevance | Grows unbounded; consolidated into facts; recency-decayed | The D&D bot forgets the NPC from five sessions ago |
| Semantic | Fact records in a vector store, or one structured profile | Extracted from conversation, or user-approved | Semantic search, usually hybrid with keyword | Should supersede, not just append | Stale facts surface as still true |
| Procedural | Behavioural rules, encoded in the system prompt | Prompt optimisation from past trajectories | Always present; not retrieved | Rarely changes; updated by re-tuning the prompt | Agent re-derives how to use knowledge every run |
| Document | Vector index of document chunks, plus keyword | Ingest and index a corpus, offline | Retrieve top-K, stuff into the prompt | Re-index when the corpus changes | Returns the right-looking chunk, not the right state |
A few of these deserve a closer look, because the differences are where systems go wrong.
Working memory is the live scratchpad: the current task, tool outputs, variables in flight. Its natural home is the context window itself, and for anything that spills over, plain lossless storage (a key-value store, an append-only log, a sqlite file) beats anything clever. One builder benchmarked memory frameworks against a plain long-context baseline and reported them running 14 to 77 times more expensive and roughly 30% less accurate at recall, because they ran a model to extract and normalise facts on every single message. That cost is the right shape for durable preferences and the wrong shape for execution state, which just needs to be stored and read back verbatim.
Document memory is RAG, and it is the type most often mislabelled as memory in full. Indexing your books or a knowledge base so the model can look things up is genuinely useful, but it answers "what do the documents say?" rather than "what do I remember about you?" A vector index returns the chunk that looks most similar to the query, which is not the same as tracking how a fact changed over time. That gap is the entire subject of RAG vs memory.
Semantic memory is where the field's hardest problem lives: a fact you stored last month may be wrong today. The classic example: a user loved Adidas, the shoes broke, they switched to Puma. A flat store that only appends keeps all three and happily returns "I love Adidas" because it scores highest. Real semantic memory has to supersede, not just accumulate. See smart extraction and temporal memory.
Procedural memory is the odd one out: it is usually not stored as data rows at all. It is the behavioural rules and skills you encode in the system prompt and improve over time. One recurring community insight is that teams over-invest in storing facts and under-invest here, so the agent re-derives how to use what it knows on every run.
#The human-memory analogy is useful, and dangerous
The split above is borrowed honestly from cognitive science. Humans really do have distinct memory systems: short-term versus long-term, episodic (what happened) versus semantic (what you know). Recent surveys bridge that vocabulary directly into AI design, and it is the reason the categories feel intuitive. Reaching for "episodic" and "semantic" is good, because it forces you to ask what kind of thing you are storing before you pick a database.
The danger is taking the metaphor literally. Human memory is associative, lossy, and self-modifying: recalling something can quietly rewrite it. Those are features for a person and bugs for a system you want to trust for reliable recall. As one practitioner put it, mapping LLM memory onto human memory one-to-one is a mistake, because AI memory has to be intentionally managed in a way the brain never is. And note that humans do not have a single general-purpose memory layer either, which is the whole point: even the thing you are copying is not one bucket. Use the analogy for vocabulary and intuition. Do not use it to justify a mechanism.
#What follows from this
Once you accept that memory is plural, the hard work is no longer "which database?" It is the curation and ranking around the store: deciding what is salient enough to keep, reconciling a new fact against a contradicting old one, forgetting what has gone stale, and ordering what you surface so the model actually uses it. Storage is largely solved. The rest of this guide is about the parts that are not. Start with context vs memory if you are wondering whether a bigger context window makes all of this moot.
#References
- Lewis et al., 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (the parametric vs non-parametric memory distinction). https://arxiv.org/abs/2005.11401
- LangMem documentation, LangChain (the semantic / episodic / procedural memory taxonomy and the profile-vs-collection design lens). https://langchain-ai.github.io/langmem/
- Zhang et al., 2024. A Survey on the Memory Mechanism of LLM-based Agents (memory sources and the write → consolidate → read operation view). https://arxiv.org/abs/2404.13501
- Wu et al., 2025. From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs (the human-to-AI memory bridge and its taxonomy). https://arxiv.org/abs/2504.15965
- Park et al., 2023. Generative Agents: Interactive Simulacra of Human Behavior (the episodic memory stream and recency + importance + relevance ranking). https://arxiv.org/abs/2304.03442
- Community discussion: "Universal 'LLM memory' is mostly a marketing term" and "Universal LLM Memory Doesn't Exist," r/LLMDevs and r/LocalLLaMA (the working / episodic / semantic / document split and the cost-of-LLM-on-write claim, reported by builders, not peer-reviewed).