AI Memory
I — Foundations · 7 min

RAG, Properly Explained

Retrieval-augmented generation as the foundation memory builds on: parametric vs non-parametric memory, the embed-store-retrieve-generate loop, a worked chat-with-your-books example, and Self-RAG's lesson about retrieving only when you need to.

A model's facts are fixed the day its training ends. Ask it about something that happened last week, or about a document only you have, and it either guesses or tells you it does not know. You could retrain it on the new facts, but that is slow and expensive, and you would have to do it again tomorrow when the facts change. Retrieval-augmented generation (RAG) is the practical answer: keep the knowledge outside the model, look up the relevant pieces at question time, and pass them into the prompt.

#Parametric vs non-parametric memory

Lewis et al. (2020) named the split the whole field now takes for granted. A model has parametric memory: everything baked into its weights during training, frozen, and costly to change. RAG adds non-parametric memory: an external store you can read, edit, and replace without touching a single weight. In the original paper that store was a dense vector index of Wikipedia, queried by a neural retriever (DPR) and feeding a BART generator. They tried two couplings, RAG-Sequence (one retrieved passage conditions the whole answer) and RAG-Token (different passages can inform different tokens), and trained the retriever and generator together, treating the choice of documents as a latent variable summed over the top-K passages.

The architecture is not the lasting part. The principle is: do not fine-tune facts into weights, retrieve them at inference. Parametric memory is frozen at training time. Non-parametric memory is editable. You can correct a wrong fact, add yesterday's meeting notes, or delete something stale, all by writing to a database, with no retraining. That property, updatable knowledge without retraining, is why nearly every AI memory system is a descendant of RAG. Long-term memory in an LLM app is almost always an external store the model reads from, not something it has memorised.

#The loop

Every RAG system, from the 2020 paper to today's memory products, runs the same five steps. Embed a corpus into vectors. Store those vectors. At question time, embed the query and retrieve the top-K most similar pieces. Stuff them into the prompt. Generate an answer grounded in what you pulled. Everything else (chunking strategy, hybrid search, reranking, conflict resolution) is a refinement on this skeleton.

INDEXING · offline Documents Chunk Embed Vector store embeddings + metadata QUERY · online Query Embed Retrieve top-k Rerank LLM Answer similarity search
Figure 1. The canonical RAG loop: embed and store a corpus, then at query time retrieve the top-K, rerank, and generate a cited answer. AskLibrary's stages are shown as the worked example.

#AskLibrary: a worked example

It helps to walk a real one. AskLibrary (a chat-with-your-books RAG product) lets you ask questions across your personal library and get answers cited back to the exact page. The pipeline:

  • Ingest. A book arrives as a PDF, or as a webpage or document converted into one. A cheap per-page model classifies pages as real content versus front and back matter (title page, table of contents, index) so the junk never gets indexed.
  • Chunk. The cleaned text is split into roughly 500-token chunks with about 25% overlap, using a recursive splitter that prefers to break on a paragraph, then a line, then a space. Each chunk records the page numbers it spans, so a citation can point to "pages 12, 13" even when a passage crosses a page break. Chunking quietly decides retrieval quality; it gets its own page in chunking.
  • Embed. Each chunk becomes a 1536-dimension vector via OpenAI's text-embedding-3-small. The vectors live in a vector database; the actual text and citation metadata live in a separate document store and are rehydrated after the search returns. More on why in embeddings.
  • Retrieve. The question is expanded into a few alternative phrasings (a small model writes four "perspectives," biased toward the titles in the user's library so the rewrites stay on topic), and the original plus the rewrites fan out into the index, pulling tens of candidate chunks. Re-including the original query matters: rewrites can quietly drop a rare name the user actually typed.
  • Rerank. A cross-encoder reranker (Jina's multilingual reranker) reorders that candidate pool and keeps the top two dozen. Reranking is the cheap last-mile precision win: the first-stage retriever is fast but blurs fine distinctions, and the reranker fixes the ordering before anything reaches the generator.
  • Cite. The answer is generated from the packed chunks, and the interface shows each source as "Title (pages 12, 13, 40)," clickable straight to that page in a viewer. The citation is the product. A confident answer you cannot trace is worse than a hedge you can.

One honest detail. AskLibrary has a full hybrid (dense plus keyword) search path built and tested, but production chat runs pure dense retrieval followed by reranking, because that was good enough for this corpus. Hybrid search is not free, and "we built it and left it off" is a legitimate outcome. When you do need exact-string matching (error codes, identifiers), see hybrid retrieval.

#Retrieve only when you need it, and check it landed

Vanilla RAG retrieves on every turn whether the question needs it or not. Asai et al. (2023), in Self-RAG, showed why that is wasteful and sometimes harmful: always retrieving burns tokens and can inject irrelevant passages that pull the answer off course. Self-RAG trains a model to emit small control signals inline: whether to retrieve at all, whether a retrieved passage is relevant, and whether a generated claim is actually supported by it. Two lessons hold regardless of whether you adopt their exact method. First, retrieval should be conditional: gate it on whether the turn needs outside knowledge. Second, verify grounding: check that the answer is supported by what you retrieved rather than invented around it. In production these become a cheap "do I need to retrieve for this?" gate and a "is this grounded?" pass, and both directly cut hallucination.

This is also the seam where RAG ends and memory begins. RAG answers "what does my corpus say?" Pulling the highest-similarity chunk is enough for that. Memory has to answer "what is true about you, now?" and that requires tracking change over time, not just similarity. That distinction gets its own page in RAG vs memory.

#References