IV — Building it for real · 9 min

Anatomy of a Memory Pipeline

How chunking, extraction, retrieval, and ranking assemble into one real system: two pipelines sharing a store, where reads must be fast, writes belong off the critical path, and the whole thing must fail open.

A user types a message and hits enter. Two different jobs fire. One reads: it has to pull whatever the system already knows about this person and put it in front of the model before the reply starts streaming. The other writes: it has to decide whether anything in this new message is worth keeping. Run both inline and the user watches a spinner while an LLM decides that "thanks, that worked" contains no durable facts. That is the design problem of a memory pipeline in a single frame. The read cannot be slow, and the write should not be on the clock at all.

Everything in the earlier pages (chunking, embeddings, extraction, dedup, conflict resolution, ranking) is a part. The system is what you get when you see there are only two pipelines, pointed in opposite directions, joined by the store in the middle.

#The two paths

The write path turns activity into something storable: capture the raw input, extract the facts worth keeping, embed them, dedup or resolve against what already exists, then store. Each step has its own page. Capture and extraction decide what becomes a fact, embeddings and chunking decide how it is indexed, and the dedup-or-resolve step is the write-time-versus-read-time fork plus the admission gate that keeps unverified output from hardening into memory.

The read path turns a question back into context: take the query, expand it into the phrasings the user likely used, run hybrid recall (dense plus keyword), rerank the candidates down to a precise few, assemble them within a token budget, inject them into the prompt, and generate.

Figure 1. A memory system is two pipelines sharing a store: a write path (off the critical path) and a read path (on it), with a fail-open bypass so a memory outage degrades the answer instead of breaking the chat.

They share one store and almost nothing else. The write path optimises for not losing anything and not corrupting what is already there. The read path optimises for latency and for putting the right handful of facts in front of the model. Their priorities pull against each other, which is why the rest of this page is mostly about keeping them apart.

#Reads block; writes do not

The read path sits on the critical path of the response, so it has a hard latency budget. The usual moves are precompute, cache, and over-fetch-then-trim. Supermemory precomputes a per-user profile so basic context is one call of roughly 50 to 100 milliseconds instead of several searches, and it keeps a per-turn cache so a tool-call loop reuses the memory string it already fetched rather than hitting the API again on every iteration. The recall step over-fetches a few dozen candidates, then a reranker cuts them to the final few, which is cheaper than trying to get ranking perfect in one pass.

The write path has no business being there. Supermemory accepts a document and returns in milliseconds with status: queued, then extracts in the background, and writes new memories after the response has already gone out. Searches are never blocked by ingestion. MemoryPlugin runs the same way: chat-history ingestion and the suggestions curator are background jobs, not inline steps, and mem0 batches its write path (batch embed, batch insert, batch history rows) for throughput rather than per-message responsiveness. The rule is simple and load-bearing: a write that does not have to finish before the user sees their reply should not run before the user sees their reply.

#What it costs to think on every write

The most expensive thing you can do is run an extraction LLM on every message. A builder benchmarked Mem0 and Zep against a plain long-context baseline on a conversational QA set and reported the memory systems were roughly 14 to 77 times more expensive over a full conversation and about 30 percent less accurate at recalling facts than just passing the full history. The root cause they named is the shared "LLM-on-write" pattern: a background model normalising facts on every turn. A separate commenter put the counter case, that an approach which does no per-turn extraction adds only about 1.001 times the cost, because the stored memory tokens themselves are tiny. Both can be true. The cost is dominated almost entirely by whether you run an LLM on every write, not by storage.

The design consequence is to extract selectively. Working and execution state (tool outputs, logs, file paths, variables) wants simple lossless storage: a key-value store, an append-only log, a SQLite file. It does not need an LLM to reword it, and under roughly 200K tokens it is often cheaper and more accurate to just keep it in context. Reserve the expensive extraction for semantic memory, the durable preferences and profile facts that genuinely benefit from being distilled, and keep even that off the critical path. This is the same logic behind mem0 collapsing two write-time LLM calls into one, and behind MemoryPlugin's choice not to silently run an extractor over every chat: its chat-history wedge stores the raw conversation and synthesises the relevant bits at read time instead, so the per-message cost is an embedding, not a reasoning call. Cost and latency takes the economics apart in detail.

#Fail open, but do not fail silent

Memory is an enhancement, not a dependency. When the memory layer breaks, the chat has to continue without it. Supermemory's router makes this the explicit goal: if supermemory errors, the request passes through to the LLM unmodified, with an x-supermemory-error header so you can see it happened. MemoryPlugin degrades the same way at every shaky step. A rerank failure falls back to the raw vector order, a failed query-expansion call falls back to the literal query, and a retrieval timeout (enforced with an abort controller) caps how long the pre-LLM fetch can block before the response proceeds with no injected memory at all. An outage should cost you a less-informed answer, never a broken one.

There is a sharp edge here, though. Fail-open is not the same as fail-silent.

From the trenches

MemoryPlugin's quality recall mode fans a relevance-and-summarisation pass out across many memories in parallel, routed to a strong provider that, under wide fan-out, would start returning "model busy, retry later." The per-call failures were being swallowed, so a saturated provider looked exactly like "no relevant context found." The system returned nothing and reported itself healthy. The fix was to stop conflating the two: if more than half the relevance calls fail, raise an explicit saturation error that tells the caller to retry or drop to the faster mode, rather than handing back an empty result. The provider also got a bounded concurrency limit and more retries so it saturated less often in the first place. Degrade gracefully, but make a failure look like a failure.

#Keep the receipts

Every durable change should leave a trail. mem0 keeps a SQLite history database, separate from the vector store, that records every ADD, UPDATE, and DELETE with the old and new text. MemoryPlugin soft-deletes memory rows (a deletion reason and a merged_into pointer to the survivor) while hard-deleting the corresponding vector, so the relational record of what happened to a fact outlives its embedding. This append-only log is not bookkeeping for its own sake. It lets you trace a fact back to where it came from, which is the only real defence against confabulated memories that pass every retrieval check because similarity search has no opinion on whether the text describes something that actually happened. It is also what makes "invalidate, do not delete" possible, and what lets you answer, auditably, what you store and why a fact changed.

Once the skeleton is right, the open questions are no longer architectural. They are how to tell whether the memory is actually helping (evaluating), what the layer costs and how to route around the expensive parts (cost and latency), and the catalogue of ways the whole thing breaks in production (failure modes).

#References

mem0, repository and "OSS v2 to v3 migration," github.com/mem0ai/mem0 (the batched single-pass write path, the separate SQLite history log of every ADD/UPDATE/DELETE, and the move of conflict resolution from write time to read-time ranking).
Supermemory, architecture and Memory Router concept docs, github.com/supermemoryai/supermemory (queued async ingestion that never blocks search, precomputed profiles, per-turn caching, and the fail-open proxy that passes a request through unmodified with an x-supermemory-error header).
r/LocalLLaMA, "Universal LLM Memory Doesn't Exist" (a builder's report of memory systems running 14 to 77 times more expensive and about 30 percent less accurate than long context, attributed to the LLM-on-write pattern, with the counter-claim that no-per-turn-extraction designs add roughly 1.001 times the cost; reported as practitioner experience, not peer-reviewed).
r/LLMDevs, "Universal 'LLM memory' is mostly a marketing term" (the working/episodic/semantic/document split and the case for lossless storage of execution state rather than running extraction over everything).