AI Memory
III — Architectures · 10 min

Hybrid Retrieval and Rank Fusion

The read path that actually works: dense plus keyword recall, fused by rank instead of raw score, reranked as the cheap last-mile win, and placed where the model will read it.

A user mentions error code TS-999 three sessions ago, then asks about it today. You stored the conversation, but dense vector search embeds the new question, lands near generic chunks about error handling, and never surfaces the exact one: an embedding blurs a rare token like TS-999 into a cloud of similar-looking strings. A plain keyword search nails it instantly. The reverse is true too: ask "what running shoes did I switch to" and keyword search misses the memory that says "my new Pumas." Neither retriever alone is enough, and most of the quality you lose is not in either retriever. It is in how you combine them.

#Two retrievers, because they fail differently

Dense retrieval compares embeddings (see embeddings): it matches meaning and shrugs off exact wording. Sparse retrieval, almost always BM25 (Best Match 25, a lexical ranking function over term frequencies), matches the actual tokens and catches what embeddings smear: error codes, proper nouns, version numbers, identifiers, anything where the literal string is the point. They miss in opposite directions, so running both and merging their hits recovers what either would drop alone.

This is the dominant pattern in production memory. MemoryPlugin runs a BM25 function alongside a dense HNSW index in one vector collection; Graphiti (the engine behind Zep) combines embeddings, BM25, and graph traversal; mem0 fuses a semantic signal with BM25 and an entity signal. The numbers back it up: Anthropic's Contextual Retrieval study measured top-20 retrieval failure dropping from a 5.7% baseline to 3.7% with better embeddings alone, and to 2.9% once BM25 was added, before any reranking.

So a real read path is a funnel: cast a wide net with both retrievers, merge the candidates into one list, rerank that list with a more accurate model, then place the survivors carefully in the prompt. Each stage trades a little compute for a lot of precision.

Query Dense (semantic) BM25 (keyword) catches exact strings TS-999 · names · IDs candidates candidates RRF fuse Rerank Top results
Figure 1. The hybrid read path as a funnel: recall wide with two retrievers, fuse the lists, rerank, then place the few survivors where the model will read them.

#Fuse by rank, not raw score

This is where the most expensive bug hides, and it is almost invisible because nothing errors.

The fix is Reciprocal Rank Fusion (RRF). Throw the scores away and add the ranks instead. Each retriever returns an ordered list; a document at rank r contributes 1 / (k + r) to its fused score, summed across every list it appears in (k is a small constant, often 60, that dampens the top). A document ranked high in both lists wins; one only a single retriever loves still gets credit. Because RRF only looks at position, the cosine-versus-BM25 scale mismatch cannot happen, and adopting it needs no re-embedding. Milvus and most vector stores support rank fusion natively; Graphiti uses RRF as its default reranker.

RANKED LISTS · different scales Dense cosine 0..1 1 docA 0.92 2 docB 0.55 BM25 raw 5..30 1 docB 22.4 2 docA 6.10 linear sum RRF (rank-based) BM25 scale dominates docA 0.92 → 0.64 docB 0.55 → 4.59 keyword wins RRF · scale-free merge score = Σ 1/(k + rank) ranks, not raw scores · k = 60 balanced
Figure 2. RRF fuses by position, not score. Summing raw cosine and BM25 lets the larger scale dominate; summing one-over-rank contributions is scale-free.

If you would rather keep scores, normalise first. mem0 pushes raw BM25 through a logistic curve whose midpoint scales with query length, since longer queries produce higher totals and a fixed normaliser would drift. Either discipline works; adding raw scores from different scales never does. The same trap reappears across query expansions: dedupe by keeping each document's best score, not the first list's, or a chunk that matched three expansions gets credit for one.

#Boost, or expand recall? Pick on purpose

There is a real fork in how keyword matching feeds the funnel. One school treats BM25 as a genuine recall source, returning its own candidate list fused with the dense list as equals (Anthropic, Graphiti, MemoryPlugin). The other treats keyword and entity signals as re-rankers only: mem0 gates candidates on the semantic score first, then lets BM25 and entity matches reorder the survivors, "a boost signal, not a recall expander," and dampens an entity's boost by how many memories it links to so a ubiquitous one does not dominate. Both are defensible: keyword-as-recall catches the pure-lexical hit dense missed but lets in lexical noise; boost-only keeps recall tight but can miss the document only the keyword would find. MemoryPlugin splits the difference with a keyword-coverage boost that nudges results by the fraction of query terms they contain, so a rare named entity (the "lawyer" in "Ayush Sunaina lawyer") is not buried under chunks that are semantically similar but about the wrong person.

#Search for what they said, not what they asked

Expansion is the cheapest recall you can buy: rewrite the query a few ways, search all of them, merge. AskLibrary (a chat-with-your-books RAG product) generates four reformulations with a small model, biased toward the user's actual library titles, then searches the original plus all four.

The subtler lever is intent. MemoryPlugin's chat-history search does not expand toward the answer; it expands toward what the user likely said in a past conversation. For "what should I do this weekend," it generates "had fun on the weekend" and "was a relaxing weekend," not "fun weekend activities," because the goal is recalling past context, and the assistant can find generic information itself.

One rule survives every expansion scheme: always re-add the verbatim original query. Rewriting paraphrases away the rare proper nouns BM25 depends on, so the literal query is your anchor for the lexical side. The cost is linear: every extra query is another fan-out and more latency, which is why supermemory's query rewriting is off by default and documented as adding about 400ms.

#Reranking is the cheap last-mile win

A reranker is a cross-encoder: it reads the query and a candidate together and scores their actual relevance, far more accurate than comparing two independently-made embeddings. In the Anthropic study, adding a reranker took failures from 2.9% to 1.9%, the largest cheap gain in the pipeline, fetching the top 150 candidates and reranking down to 20. Voyage rerank-2.5, Jina's multilingual v2, and bge-reranker-base (supermemory reports roughly +100ms for it) are all in production use.

There are two ways to waste it. First, starve it: cap recall at a few dozen candidates and the reranker never sees the document it would have promoted, so fetch 150 or more. Second, feed it junk: with one-message chunks, a reranker sees "yes let's do that," which has no rerankable signal. Prepend the conversation title and a neighbouring turn so there is something to judge.

When your memory is a graph, the reranker can use structure vectors cannot see. Graphiti offers node-distance reranking (facts closer to the entity you care about rank higher), episode-mention frequency (mentioned more often means more important), a cross-encoder for precision, and Maximal Marginal Relevance (MMR, default lambda 0.5) to suppress near-duplicate hits and keep the top results diverse.

#Place the winners where the model reads them

Winning the ranking is not the end. Models read the start and end of a long context far better than the middle, the U-shaped "Lost in the Middle" curve (see context vs memory), so burying a relevant fact mid-prompt can score worse than omitting it. Put your best memories at the top and bottom of the injected block, keep the set small, and resist dumping every candidate in: more context is not better when the good stuff lands in the dead zone.

#References

  • Anthropic (2024). "Introducing Contextual Retrieval" (hybrid dense plus BM25 fused with reciprocal rank fusion, plus reranking; the 5.7% to 1.9% failure-rate ladder). anthropic.com/news/contextual-retrieval.
  • Liu, N. F. et al. (2023). "Lost in the Middle: How Language Models Use Long Contexts." arXiv:2307.03172.
  • Cormack, G. V., Clarke, C. L. A., and Buettcher, S. (2009). "Reciprocal Rank Fusion outperforms Condorcet and individual rank learning methods." (origin of RRF). doi.org/10.1145/1571941.1572114.
  • Graphiti (Zep). Hybrid search with RRF, MMR, node-distance, episode-mention, and cross-encoder rerankers. github.com/getzep/graphiti.
  • mem0. Multi-signal scoring with adaptive BM25 normalisation and entity boosting (BM25 as a re-rank signal, not a recall expander). github.com/mem0ai/mem0.
  • Robertson, S. and Zaragoza, H. (2009). "The Probabilistic Relevance Framework: BM25 and Beyond." doi.org/10.1561/1500000019.