AI Memory
V — The landscape · 9 min

The Memory Tool Landscape

A neutral, use-case-keyed map of the real AI-memory tools, split into corpus and agent memory families, with the DIY markdown floor and an honest caveat about the benchmark wars.

You have read how the pieces work. The question that follows is the one every builder actually asks: which of these tools should I use? The honest answer is that it depends on the question you need answered, and nothing here wins everywhere. A system tuned to summarise a 10,000-page corpus is not the system you want tracking that a user switched from Adidas to Puma last week. This page is a map, not a ranking. It sorts the field into two families, places the real tools on a structure axis, and is blunt about where the leaderboards mislead.

#Two families that the field keeps conflating

The word "memory" covers two overlapping jobs. Corpus memory turns a static document set into a structured index you can reason over: ingest a corpus once, build a graph or a tree, then answer questions no single chunk contains. GraphRAG, HippoRAG, RAPTOR, and LightRAG live here. Agent memory accumulates, updates, and recalls facts about a user or agent across many sessions: it has to handle a fact changing, being superseded, or needing to be forgotten. mem0, Zep/Graphiti, Letta, LangMem, and A-MEM live here. The two families share machinery (embeddings, chunking, hybrid retrieval) but optimise for different failure modes, which is why a corpus tool feels wrong as a personalisation layer and vice versa. Cognee straddles the line: it builds a knowledge graph out of documents, but with the incremental updating and ontology grounding that agent-style memory needs.

The second axis is how much structure a tool imposes at write time, from flat vectors at the bottom to a full graph at the top. More structure buys multi-hop and temporal reasoning; it costs more to build and keep in sync.

Graph / structured Flat / vector Corpus memory Agent memory corpus · graph agent · graph corpus · vector agent · vector GraphRAG HippoRAG RAPTOR Zep/Graphiti A-MEM supermemory Letta mem0 LangMem MemoryPlugin
Figure 1. The landscape on two axes: corpus versus agent memory (horizontal) and flat versus graph structure (vertical). Plain vector RAG and DIY markdown anchor the flat corners; temporal graphs sit at the top.

#A use-case-keyed comparison

Read this by your use case, not top to bottom. The "approach" column is what the tool actually does; the "best fit" column is the question it answers well. None of this implies a quality ordering.

Tool Family Approach Best fit
mem0 agent Lean memory CRUD; v3 is single-pass ADD-only with conflict resolution moved to read-time ranking User-level personalisation and session memory, not deep document corpora
Zep / Graphiti agent Bi-temporal knowledge graph; contradictions invalidate edges rather than delete them Evolving facts where "what was true when" matters (conversational KGs)
Letta / MemGPT agent OS-style tiered memory the agent self-edits via tool calls Autonomous agents that curate their own working memory
LangMem agent Primitives, not a store: semantic / episodic / procedural memory on LangGraph Building your own memory layer with composable parts
A-MEM agent Self-organising Zettelkasten notes; a new note can rewrite its neighbours Emergent, self-evolving note memory (research-stage)
supermemory agent (+ RAG) Memory API plus a transparent proxy; auto-maintained static/dynamic user profiles Drop-in memory for an app, via SDK or a base-URL swap
GraphRAG corpus Entity graph plus Leiden communities, each summarised by an LLM Global "sensemaking" over a fixed corpus ("what are the themes?")
HippoRAG corpus OpenIE triples plus Personalized PageRank for one-step multi-hop Cheap multi-hop QA over a corpus, far lighter to index than GraphRAG
RAPTOR corpus Recursive embed/cluster/summarise into a tree; query every level Hierarchical document QA needing both detail and theme
LightRAG corpus Dual-level (entity + theme) graph RAG with incremental updates Graph RAG over documents that change without full re-indexing
Cognee both Extract/Cognify/Load pipeline; triples validated against an ontology Automated KG construction from heterogeneous documents
MemoryPlugin agent Cross-tool consumer memory plus background chat-history sync, surfaced over MCP Portable memory across ChatGPT, Claude, and Gemini, with chat history searchable on demand

A few honest contrasts behind the table. mem0 and Graphiti sit at opposite ends of the conflict-resolution fork: mem0 concluded that diffing every new fact against old memory at write time was slower and lower-quality than appending and ranking at read time, while Graphiti keeps a full bi-temporal history so it can answer time-travel queries (see updates and conflicts and temporal memory). Letta hands curation to the model itself, so memory quality tracks the model's judgment; the background-pipeline tools are more predictable but more rigid. On the corpus side, GraphRAG's per-community summaries unlock global questions but inject text that can hurt simple factual recall, which is exactly what HippoRAG was built to avoid (knowledge graphs develops this).

#The markdown floor

Before any of these, there is a floor worth taking seriously: plain files. A popular pattern (Andrej Karpathy's "LLM wiki" is the usual anchor) is a MEMORY.md for durable facts, a TASKS.md for current priorities, and dated episodic logs, all read at session start and all git-tracked. It is genuinely good for a reason builders repeat: with fifty to a hundred key facts, you can read the memory, debug it, and version-control it, and a vector database is overkill. A common refinement is two tiers, raw daily logs distilled periodically into the main file, with a light embedding search added only once you pass twenty-odd files.

The floor breaks in four predictable places, and they map onto what the products sell: portability across tools (files live in one client), cross-device sync, scale beyond a few hundred facts (linear reads stop being free), and auto-capture (you have to remember to write everything down, which defeats the point). That last one is the wedge for chat-history memory: you cannot reliably predict in advance which conversation will matter later, so capturing it automatically beats hand-curated notes.

#On the benchmark wars

The leaderboards are where credibility goes to die, so treat them carefully.

There are two further reasons to discount single-number claims. The benchmarks themselves are contested: a Letta engineer's much-quoted line is that "memory is not retrieval, memory is active management of context, and LoCoMo is simply not designed for that," and vendors have publicly accused each other of unfair benchmark setups. And the cost axis is usually omitted: one builder reported that running an extraction LLM on every message made memory systems roughly 14 to 77 times more expensive and about 30% less accurate than just passing full history, for working-state recall (a community claim, not a peer-reviewed result, and it depends entirely on whether you run an LLM on every write). The right move is to report the triple of accuracy, latency, and context tokens together, which the evaluating page covers.

#How to choose

Start from the symptom. If facts about a user change over time and stale ones keep resurfacing, you want agent memory with real supersession, which at the structured end is a temporal graph (Zep/Graphiti) and at the lean end is mem0 or supermemory for straightforward personalisation. If you need to reason over a fixed body of documents, you want corpus memory: GraphRAG for whole-corpus themes, HippoRAG for cheap multi-hop, RAPTOR for mixed detail-and-theme QA, LightRAG when the corpus keeps changing, Cognee when you want an ontology-grounded graph built for you. If you want the agent to manage its own memory as it works, that is Letta. If you are assembling your own stack, LangMem gives you primitives. And if the goal is to own one memory that follows you across ChatGPT, Claude, and Gemini, with past chats captured automatically, that is the cross-tool consumer case (across tools) that MemoryPlugin targets. Most teams need less structure than the demos suggest; match the tool to the question, and let the failure modes you actually hit pull you up the structure axis.

#References