IV — Building it for real · 9 min

How Memory Systems Break in Production

A field catalog of how AI memory systems fail, each with its symptom, root cause, the fix that worked, and the lesson, from fan-out amplification and score-fusion bugs to hallucinated IDs and confabulation.

Most memory bugs do not announce themselves. A retrieval system that returns the wrong chunk looks identical to one returning the right chunk: text comes back, the model answers, no stack trace. A background extraction job dead for two weeks looks the same as a system with nothing to remember. The interesting failures are silent, plausible, or both.

What follows is a catalogue drawn from production memory systems (MemoryPlugin and AskLibrary among them) and the open-source field. Each entry has the same shape: the symptom you observe, the cause underneath, the fix that worked, and the lesson. They fall into three families: infrastructure that fails quietly, retrieval that surfaces the wrong thing, and an LLM corrupting the store it maintains.

#Fan-out amplification: one flaky call, multiplied

A feature passes every test, then fails for real users at a rate nobody can explain. In one production knowledge-graph builder, 8 of 16 runs failed over a month, and a user with a large bucket failed six times out of six. The job fanned out roughly 14 structured-output LLM calls in parallel and waited on all of them. Each failed independently about 5% of the time, returning JSON that would not validate against the schema. With N independent calls at per-call failure rate p, the whole job fails at 1 - (1 - p)^N, about 50% at p of 5% and N of 14. A job-level retry only re-ran all 14 calls and re-rolled every die. The fix was a bounded retry around each call (three attempts, exponential backoff, bailing on non-transient errors like a bad key or quota) plus partial degradation, so one dead batch costs a few entities instead of the whole graph.

Lesson: failure probability compounds across fan-out. Harden the individual call, not just your choice of model. Cost and latency works through the arithmetic.

#Score fusion across mismatched scales

A hybrid search configured as "70 percent semantic" behaves like a keyword engine. An audit found results fused as a weighted sum, dense times 0.7 plus sparse times 0.3. But dense is a cosine value near 0 to 1, while raw BM25 is unbounded, usually 5 to 30. A perfect semantic hit with no keyword overlap (cosine 0.92, BM25 0) fuses to 0.64; a mediocre keyword match (cosine 0.55, BM25 14) fuses to 4.59, beating it by seven times. The fix is to fuse by rank, not raw score: Reciprocal Rank Fusion adds 1/(k + rank) across lists, so the scale mismatch cannot happen, and it needs no re-embedding.

Lesson: never linearly combine scores that live on different scales. Hybrid retrieval covers fusion in depth.

#Garbage memories suppress recall

A safety-trained model refuses to use injected memory at all, sometimes saying out loud that the memories would "just add more clutter." The injected list in one case held entries like "A cat is better than a bat", a bare "...", and four near-identical copies of a casual greeting. Low-value and duplicate memories do not merely waste tokens; they hand the model a concrete reason to reject the whole recall protocol, and a delivery format that reads as prompt injection compounds it (see privacy and pitfalls). The fix is to score quality at save time, refusing greetings, small talk, and short fragments, dedupe server-side before injection, and cap the injected set to a few strong examples rather than a dozen mixed ones.

Lesson: hygiene is a recall feature, not housekeeping. Garbage degrades the model's willingness to use any of it.

#Hallucinated IDs

A curation job that mutates memory starts throwing ownership errors about missing IDs, because the LLM proposed operations on records that do not exist. Models invent identifiers: ask one for the IDs of memories to merge or delete and it will sometimes return a UUID it never saw, or one already deleted mid-run. Two defences stack. First, never let an unvalidated ID reach a write: check every returned ID against the exact input set and assert the user owns it before any deletion fires. Second, do not show the model raw UUIDs at all. mem0 labels existing memories with small integers ("0", "1", "2") and maps them back afterwards, instructing the model to reuse the input IDs only.

Lesson: an LLM that mutates a store cannot be trusted with identifiers. Validate every ID against input and ownership. Memory suggestions shows the full admission flow.

#Silent infrastructure outage

A feature produces nothing for weeks and no alarm fires, because "no output" is indistinguishable from "nothing to do." Conversation summaries once failed 100 percent for two weeks. The cause: the API key for the background model provider had never been added to the job runner's production secrets, so a missing key instant-failed the task with no loud signal. From the outside, an empty result and a total outage look the same. Backfilling recovered the stuck jobs, but the real fix is operational: alert on task failure rate and on a feature emitting zero output when it should emit some, and treat secret provisioning as part of every deploy.

Lesson: an un-provisioned secret is a silent total outage. Monitor failure rates and suspicious silence, because background work will not raise its hand.

#First-topic dominance

Extraction captures the opening topic of a multi-topic message in rich detail, then quietly drops everything after it. This is a named failure mode in mem0's extraction prompt: the model handles the first subject thoroughly and treats the rest of a long turn as filler, so later facts never become memories. There is no error. The facts are not there when you go looking later. The fix is to bias toward over-extraction. mem0's prompt tells the model "when in doubt, extract" (a redundant memory is cheaper than a missing one) and adds a coverage check: for conversations of ten or more messages, expect five to fifteen memories, and if you found fewer than three, re-read.

Lesson: extraction silently truncates. Bias toward recall and verify coverage. Smart extraction goes deeper.

#Stale and resolved-versus-relevant over-injection

The agent keeps re-answering questions that were already settled and resurfaces facts that are no longer true. To a vector database, a closed ticket and an open one are the same object: both are semantically similar, and both compete for injection. Similarity has no notion of whether a fact is current or its work is done. One team reported that a single resolution-state field fixed most of their noisy recall: once a chunk is marked resolved, it stops competing even when similarity would pull it in. A self-inflicted variant is post-multiplying a reranker's relevance by a recency boost, so for a recall tool meant to surface old context, this week's chatter beats last year's decisive answer. The fix is explicit state and time fields the retriever can filter on (see temporal memory), not recency heuristics overriding relevance for a tool that exists to remember.

Lesson: similarity is not the same as relevant-right-now. Encode resolution and recency as fields the retriever filters on, not multipliers over the score.

#Confabulation: trusting your own logs

A self-writing agent records actions it never performed, then treats those records as fact in a later session. One builder watched a run "confidently log three actions it never took, then use those logs as context the next session." The entries are not stale; they were never true, inferred from a plausible but wrong read of what happened. Because similarity search has no opinion on whether text describes something real, a confabulated memory passes every retrieval check. The fix is to separate proposing a memory from blessing it: an agent may suggest, but a durable write should be tied to a verifiable ground-truth signal with a real timestamp (a tool-grounded event, a user confirmation, a deterministic validator), never the agent's own narration.

Lesson: the dangerous bug is not forgetting, it is letting unverified output harden into trusted memory. RAG versus memory and memory suggestions cover the admission gate.

#References

mem0, repository and prompts, github.com/mem0ai/mem0 (the named "first topic dominance" extraction failure and the "when in doubt, extract" coverage check; integer-ID indirection against UUID hallucination; conflict resolution moved to read-time ranking).
r/LLMDevs, "RAG has not felt like enough for agent memory" (the storage-versus-admission split, confabulation in self-writing agents, and the resolved-versus-relevant field fix; reported as practitioner experience, not peer-reviewed).
r/LocalLLaMA, "Are vector databases fundamentally insufficient for long-term LLM memory?" (flat embeddings cannot update or supersede a fact, the root of stale over-injection; community discussion).
Cormack, G. V., Clarke, C. L. A., and Buettcher, S. (2009). "Reciprocal Rank Fusion outperforms Condorcet and individual rank learning methods." doi.org/10.1145/1571941.1572114 (the rank-based fix for cross-scale score fusion).
Anthropic (2024). "Introducing Contextual Retrieval." anthropic.com/news/contextual-retrieval (the hybrid dense-plus-BM25-plus-rerank pipeline that rank fusion slots into).