IV — Building it for real · 8 min

Evaluating Memory Systems

How to tell whether a memory system actually works, why the popular benchmarks mislead, and the accuracy-latency-tokens triple to report instead of a single score.

A vendor tells you their memory layer scores 91% on LoCoMo. Another claims to be number one on LongMemEval at 81.6%. A third reports 93.4% on the same LongMemEval. All three numbers are real, all three are published, and none of them rank consistently against each other. That is the first thing to understand about evaluating memory: the headline score is usually the least informative number in the report.

#What the standard benchmarks measure

Three benchmarks come up again and again, and they test progressively harder problems.

DMR (Deep Memory Retrieval) comes from the MemGPT line. It checks whether an agent can pull a specific fact out of a long conversation. The trouble is that scores have bunched up near the ceiling: MemGPT reported 93.4%, Zep reported 94.8% and as high as 98.2% with a different backbone model. When everyone clusters in the mid-90s, the benchmark has stopped telling systems apart. It is close to saturated.

LoCoMo is the conversational one everybody quotes. It poses question-answer pairs over long, multi-session dialogues, split into single-hop, multi-hop, and temporal categories. mem0's results are typical of how it gets cited: their V3 rewrite reported a jump from 71.4 to 91.6, and their paper claimed gains of roughly +5% single-hop, +11% temporal, and +7% multi-hop over prior methods, with p95 latency cut more than 91% versus stuffing the whole conversation into context.

LongMemEval (Wu et al., 2025) is the hard one, and the most useful. It is 500 curated questions stressing five distinct long-term-memory abilities: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention (knowing when the answer is not in memory and declining to invent it). The paper found commercial assistants degrade by 30% or more over sustained interaction. Because it breaks the score out by ability, it tells you where a system fails, not just that it scored some number. Zep's published breakdown is a good example of what that buys you: large gains on temporal reasoning (+38.4%) and multi-session questions (+30.7%), and a loss of 17.7% on single-session-assistant questions. A system can be excellent at one ability and worse than the base model at another.

#Why the headline number is suspect

The single most clarifying data point here is a community report that Gemini 2.5 Flash, with no memory system at all, scores 72.8% on LoCoMo just by reading the transcript. If a strong base model gets into the seventies for free, then a dedicated memory system reporting the eighties has a much smaller real margin than its marketing implies. The honest baseline for any memory layer is not zero. It is whatever the underlying model already does when you hand it the raw history, and that baseline is high and climbing.

This is why cross-vendor comparisons in marketing should be read as adversarial. Each vendor benchmarks under the configuration that flatters it. mem0 was publicly accused of benchmarking both Zep and Letta unfairly, with one rebuttal titled along the lines of "lies, damn lies, statistics: is Mem0 really SOTA." The 81.6% and 93.4% LongMemEval claims that opened this page come from two different vendors measuring under setups you cannot line up. Neither is lying. They are just not measuring the same thing.

There is a deeper objection, voiced by a Letta engineer in the most-quoted definition in this corner of the internet:

Memory is not retrieval. Memory is active management of direct context or things that could be brought into context, and LoCoMo is simply not designed for that.

LoCoMo and its cousins reward retrieving the right-looking chunk in answer to a question. But a memory system's real job is deciding what enters the context window in the first place, what gets superseded, and what to leave out so it does not crowd the prompt. A benchmark built around question-answer pairs cannot see most of that work. It is measuring a proxy.

#Report the triple, not a single score

Supermemory's MemScore is the cleanest answer to the single-number problem, and it is worth adopting whether or not you use their tooling. Instead of collapsing quality to one figure, MemScore reports three together: accuracy, latency, and context-tokens, written like 85% / 120ms / 1500tok.

Figure 1. The MemScore triple: a memory system's quality is accuracy, latency, and injected context-tokens read together, not any one of them alone.

The third axis is the one most evaluations miss. Their own framing puts it well: a provider with 90% accuracy at 5,000 tokens is a very different thing from one with 90% accuracy at 500 tokens. The first is matching the second's accuracy only by dumping ten times as much into the prompt, which costs money on every call, eats the context budget, and (per Lost in the Middle) can actively degrade the model by burying the good material. A system that hits the same accuracy with fewer tokens has more precise retrieval. Accuracy alone hides that completely.

One detail makes the token axis honest: MemScore counts only the tokens in the retrieved context string the memory layer contributed, not the base prompt and question. You are measuring what the memory provider added, isolated from everything else, with a real tokenizer rather than a character estimate. Without that isolation, the number drowns in the rest of the prompt and tells you nothing.

#LLM-as-judge, and its catch

For open-ended answers there is no exact-match key, so correctness is scored by an LLM judge: feed the question, the reference answer, and the system's answer to a model and ask whether it is right, then compute correct over total. MemScore's quality figure works exactly this way, and GraphRAG's evaluation of comprehensiveness and diversity does too. It is the only practical way to grade free-form responses at scale.

The catch is that your metric now contains a second model's opinion. Judge models can be inconsistent run to run, can reward fluent or verbose answers over terse correct ones, and can drift if you change judge versions mid-evaluation. None of that makes the technique unusable, but pin the judge model and version, spot-check its verdicts against human judgment, and prefer deterministic checks (exact dates, numbers, identifiers) wherever the answer admits one.

The abilities that separate real memory from retrieval are the ones the timeline examples test: a fact that changed (the user loved Adidas, then switched to Puma), a fact bound to a time ("Alice now works at Stripe"), and a question whose answer was never stored at all. If your evaluation does not include knowledge-updates and abstention cases drawn from your own traffic, a high LoCoMo number is telling you about somebody else's conversations, not yours. For why those cases are hard in the first place, see updates and conflicts and temporal memory.

#References

Wu et al., "LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory," arXiv 2410.10813 (the five long-term-memory abilities and the ~30%+ degradation of commercial assistants over sustained interaction).
Packer et al., "MemGPT: Towards LLMs as Operating Systems," arXiv 2310.08560 (DMR and the 93.4% figure).
Rasmussen et al., "Zep: A Temporal Knowledge Graph Architecture for Agent Memory," arXiv 2501.13956, and the Zep "state of the art agent memory" writeup (blog.getzep.com) (DMR 94.8%, the LongMemEval per-ability breakdown, and the honest -17.7% single-session result).
mem0, "OSS v2 to v3 migration" and repository, github.com/mem0ai/mem0 (LoCoMo 71.4 to 91.6, LongMemEval 67.8 to 93.4, and the latency reductions).
Supermemory, MemoryBench and the MemScore docs, github.com/supermemoryai/supermemory (the accuracy / latency / context-tokens triple, the token-isolation method, and the "90% at 5,000 tokens vs 90% at 500 tokens" rationale).
snap-research/locomo, the LoCoMo benchmark repository, github.com/snap-research/locomo (the reported 72.8% LoCoMo result for Gemini 2.5 Flash with no memory system, cited as a community claim).
r/LocalLLaMA, "Woah. Letta vs Mem0" (the "memory is not retrieval" definition, the vendor benchmark-war accusations, and the Gemini-on-LoCoMo data point) and "Universal LLM Memory Doesn't Exist" (the 14 to 77 times cost and ~30% accuracy claims, reported as practitioner experience).