IV — Building it for real · 8 min

Cost and Latency

Why embeddings are cheap and LLM calls dominate a memory bill, and how to route models by task, survive fan-out, and pin latency without breaking privacy or the critical path.

The first instinct when a memory system's bill arrives is to blame the vector database. It's almost always wrong. Storing and searching vectors is cheap, and embedding text is cheaper still. What costs money, and what costs seconds, is the reasoning you wrap around the storage: the LLM calls that extract facts, dedup them, summarise a bucket, build a graph, or judge whether a retrieved chunk is relevant. Get the routing of those calls right and a memory layer is affordable. Get it wrong and a single flaky model on a fan-out can fail half your jobs, or a privacy-correct default can quietly multiply your latency by ten, without a line of the storage code changing.

#The cost is the thinking, not the storing

In one production system (MemoryPlugin), embeddings run about $0.00002 per 1,000 tokens, roughly $200 for ten billion tokens. At that rate, embedding every memory and every chat chunk is a rounding error next to everything else. The vector search is cheap too: approximate nearest-neighbour over an HNSW index is CPU-bound and never scans the whole set.

The bill comes from the reasoning calls. Extraction turns a turn into facts. A dedup curator runs an LLM over each cluster of similar memories. Summaries compress a bucket. Knowledge-graph generation fans a model out over memory batches. At read time, a relevance pass judges and rewrites each retrieved chunk. Every one of those is an inference call, often many per item, and that is where both the dollars and the milliseconds live.

#Route by what gets reviewed

The highest-leverage decision isn't which single cheap model to standardise on. It's matching each task's model to whether its output gets checked downstream.

Take two tasks in the same system, decided in opposite directions. Knowledge-graph generation fires many parallel extraction calls and then feeds two quality stages: an entity-review pass, and a stronger model reviewing the assembled graph. Because those stages clean up after it, the extractor can be a fast, lean model. A benchmark in that system found a lean model and a much larger one produced the same entity count, so paying for the larger extractor bought nothing. Conversation summaries are the opposite shape. A summary is injected into the prompt as written, and nothing reviews it. So that system deliberately keeps summaries on the stronger, higher-quality model, precisely because the output is unverified. Same org, same week, opposite call, decided entirely by what reads the result next.

Figure 1. Route by what reviews the output. A task whose result is cleaned up by downstream stages can use a lean, cheap model; a task whose output nothing checks should pay for the stronger one.

#Fan-out multiplies failure

A cost that shows up on no pricing page: parallelism multiplies a model's flakiness. If a single structured-output call fails to validate with probability p, and you run N of them in parallel and require all to succeed, the job fails with probability 1 - (1 - p)^N.

The numbers from one such job make it concrete. A preview model returned schema-invalid structured output about 5% of the time. The graph job ran roughly 14 of those calls in a Promise.all, which rejects if any single call fails. At p ≈ 0.05 and N = 14, the formula predicts about a 50% job-failure rate, and the system saw 8 of 16 runs fail over a month. One user's bucket failed 6 times out of 6 and never once succeeded. A task-level retry didn't help: retrying re-ran all 14 calls and re-rolled every die.

The fix is resilience, not a model swap. Wrap each call in its own bounded retry (exponential backoff, bailing immediately on non-transient errors like a bad key or a quota), and degrade partially so one dead batch costs a few entities rather than the whole graph. A model swap alone wasn't trusted, because any model misses a nested schema occasionally. A twist proved the point twice: swapping in a stronger but slower model then timed the whole job out under the same 14-way fan-out. The landing spot was the lean extractor plus the review stages above it, which generates each user's graph in under a minute.

#The latency traps: throughput, saturation, and silent outages

Three production scars, all latency, none about the model's quality.

Throughput routing. Summaries in that system route through an aggregator with a privacy filter that restricts to providers which do not retain data. With the filter on but no throughput preference, the aggregator round-robined across slow compliant providers, and one 136K-token conversation took 8.9 minutes. Adding a throughput sort while keeping the same compliance filter pinned it to a fast compliant provider: 8.9 minutes dropped to about 44 seconds, roughly 12 times faster, at one tenth the cost, same model and quality. A privacy-correct default had quietly added an order of magnitude of latency. (A tempting follow-up, filtering by numeric precision, backfires: the most-degraded precision, fp4, also tended to be the slowest, so the throughput sort already routed around it, while fp8 was that model's native training precision and no downgrade.)

Saturation as a first-class error. Under wide read-time fan-out, the strong provider used for relevance judging starts returning "model busy, retry later." Those per-call failures were being swallowed, so a saturated provider looked exactly like "no relevant context found." The repair had two halves: bound concurrency for that one provider and raise its retry count so it saturates less, and when more than half the calls in a batch fail, surface an explicit saturation error that tells the caller to retry or switch to the faster mode, instead of returning a false empty.

Silent outages. The most expensive failure in this set cost no compute at all. A summary feature failed 100% for two weeks because an API key was never added to the background-jobs environment, and a missing key fails the task in a few seconds with no loud signal. Provisioning is part of cost. A feature that instant-fails doesn't bill you; it just quietly stops working.

#Keep memory off the critical path

The cheapest expensive call is the one you never make. Writes (extraction, dedup, summarisation) belong in async jobs that run after the response, not inline, and reads get a latency budget and degrade gracefully when they blow it (see pipeline anatomy). Under roughly 200K tokens, the honest answer is often to skip the memory machinery entirely and put the material in context: a builder community reported memory systems running 14 to 77 times more expensive than plain long context on a conversational QA set, driven by running an extraction LLM on every single message. Treat that as a practitioner claim, not a benchmark, but the direction is right: the per-message reasoning call, not the storage, is what blows up the bill.

When memory is a service, the economics are explicit. Supermemory's router writes new memories asynchronously after the response, claims up to 70% token savings on long conversations by re-injecting only relevant memories once a thread passes about 20K tokens, and prices usage near $1 per million tokens. Whatever the provider, the cost question reduces to the same triple worth measuring anyway: accuracy, latency, and the context tokens the memory layer actually contributes (evaluating). A system that hits 90% accuracy at 500 injected tokens is cheaper to run forever than one that hits 90% at 5,000.

#References

MemoryPlugin (one production system), engineering notes on knowledge-graph generation, conversation summaries, and read-time relevance routing. Source of the fan-out failure math (about 5% per call, 14 parallel, 8 of 16 jobs failing), the throughput-sort speedup (8.9 minutes to roughly 44 seconds at one tenth the cost), and saturation-as-error handling described above.
OpenRouter, Provider Routing documentation (the sort: "throughput" and data_collection: "deny" controls behind the privacy-versus-latency tradeoff).
Supermemory, Memory Router and pricing concepts, github.com/supermemoryai/supermemory (async memory creation off the critical path, the up-to-70% token-savings claim past ~20K tokens, and the $1-per-million-tokens usage model).
r/LocalLLaMA, "Universal LLM Memory Doesn't Exist" (a builder's report of memory systems running 14 to 77 times more expensive than long context, attributed to the LLM-on-write pattern; reported as practitioner experience, not peer-reviewed).
Voyage AI, embeddings pricing and documentation (the order-of-magnitude basis for "embeddings are cheap": fractions of a cent per thousand tokens).