Chunking
How chunking quietly sets the ceiling on retrieval quality, covering size and overlap, semantic versus fixed splitting, AST-aware code chunks, embedding hygiene, and Anthropic's Contextual Retrieval numbers.
Chunking is the step nobody demos and everybody underestimates. Before a reranker runs, before a single similarity score gets computed, your splitter has already decided what a "unit of memory" is. Cut too coarse and one vector has to stand for five different ideas, so it matches everything weakly and nothing strongly. Cut too fine and the fact you needed gets stranded from the context that made it meaningful. A lot of retrieval problems that look like an embedding problem are really a chunking problem.
#Size and overlap
The two knobs you reach for first are chunk size and overlap. Overlap means repeating a tail of one chunk at the head of the next, so a sentence that straddles a boundary survives intact in at least one chunk.
AskLibrary (our chat-with-your-books RAG product) chunks at roughly 500 tokens, which it computes as 2000 characters at the rough four-characters-per-token rule, with 25% overlap (about 125 tokens). It uses LangChain's RecursiveCharacterTextSplitter, which tries to break on a paragraph boundary first, then a line, then a space, then a raw character. MemoryPlugin's chat-history index goes much smaller: 256 tokens with 32-token overlap for multi-turn chunks, and zero overlap for single-turn, where each message becomes its own chunk. The gap is deliberate. Books reward larger passages that carry a continuous argument. Chat arrives pre-chopped into short turns, where the useful unit is a message or two.
supermemory's published guidance lines up with this range: 256 to 512 tokens when you want precise citations, 512 to 1024 for question answering, 1024 to 2048 for long-form analysis. The tradeoff is mechanical, not mysterious. Smaller chunks give you more memories per document and sharper retrieval, but each result carries less surrounding context. Bigger chunks carry context but blur, because the embedding becomes an average of several topics and stops being a crisp match for any one of them.
#Fixed versus semantic, and code that needs an AST
A fixed splitter cuts every N characters regardless of meaning. It's fast, and it routinely slices through the middle of a sentence or a table. The recursive splitter is the cheap improvement most people actually ship: it prefers natural boundaries (blank line, then newline, then space) and only falls back to a hard character cut when nothing better is available. Semantic chunking goes one step further and splits on logical structure, like markdown headings, article sections, and paragraph breaks.
Code breaks all of these heuristics. Splitting source by token count will sever a function from its signature or orphan a closing brace, and the resulting vector is close to meaningless. AST-aware chunking parses the code first, then keeps each function or method intact, chunks a class by its methods, groups imports together, and attaches a comment to the code it documents. supermemory open-sourced a library for exactly this (code-chunk), so a search for "authentication middleware" returns the whole function instead of a random slice through it.
#Embedding hygiene
Decide what text actually goes into the embedding, separately from what you store. A chat message has a role and a timestamp. It's tempting to embed user (2026-05-15 14:02): let's switch to Puma verbatim. Don't. The role label and the timestamp are scaffolding. Embedded, they add noise that pulls the vector toward every other message carrying the same kind of prefix. MemoryPlugin strips role and timestamp prefixes out of the embedded text and keeps them as structured columns and vector-database scalar fields, used for filtering and display rather than matching. Its own build notes give the plain rationale: removing the prefixes reduces embedding noise and improves retrieval. AskLibrary makes a related split a different way, keeping the vectors in one store and the citeable text and metadata in another, then rehydrating the text at query time.
#Contextual Retrieval
A chunk that reads "the company's revenue grew 3% over the previous quarter" is nearly useless on its own. Which company? Which quarter? The embedding has no way to know, so it won't reliably match a query about "ACME's Q2 2023 performance." Anthropic's Contextual Retrieval fixes this by using a cheap model to write a short, chunk-specific context (usually 50 to 100 tokens) that situates the chunk inside its document, then prepending that context before the chunk is embedded and before it is added to the keyword index.
The measured results are the reason to care, and they're among the most actionable in this guide. On a top-20 retrieval task with a 5.7% baseline failure rate:
- Contextual embeddings alone bring it to 3.7%, a 35% reduction.
- Adding a contextual BM25 keyword index, fused by reciprocal rank, reaches 2.9%, a 49% reduction.
- Adding a reranker on top reaches 1.9%, a 67% reduction.
The cost stays modest because the context-writing call runs through prompt caching. Anthropic reports roughly $1.02 per million document tokens, using Claude 3 Haiku to generate the context, with the reranking pipeline fetching the top 150 candidates and cutting down to 20. The same write-up draws an honest boundary worth repeating: under about 200K tokens (roughly 500 pages), skip retrieval entirely and put the whole corpus in the prompt. Contextual Retrieval earns its complexity only above that scale.
#Provenance: a chunk has to point home
The moment you cite a source ("Title, pages 12 to 13"), you need a reliable map from the chunk back to exact pages or messages. AskLibrary builds this by concatenating a book's content pages into one long string, recording each page's character span, then tagging every chunk with the page numbers whose spans overlap it. That is how a single chunk can cite a multi-page range. There's a genuine footgun in the approach: it locates each chunk's span by searching for the first occurrence of the chunk text, so if identical text repeats in the book, the page mapping can point at the wrong copy. The general lesson is to carry explicit character offsets through chunking rather than recovering them afterward with a string search.
Conversation trees add their own trap. A chat with regenerated answers and tool-call branches is not linear. Chunk across branches and you mix alternate replies into one vector, producing what our build notes call "muddy vectors" that recall worse. The fix is to chunk along a single coherent root-to-leaf path and attach path-aware metadata (conversation id, path id, turn range, message ids) so any slice can be highlighted back in the original thread.
#Stop guessing, measure
None of these choices has a universal answer, so treat chunking as an experiment rather than a constant. MemoryPlugin runs the same conversations through two strategies side by side: multi-turn (256 tokens, 32-token overlap) and single-turn (one chunk per message), tagged with a chunking_mode field on a single vector collection and filtered at the vector layer so a search never mixes the two. That harness is how it surfaced a non-obvious cost: a single-turn chunk is often one short message ("yes, let's do that") with no rerankable signal, which starves the reranker downstream. You only catch a regression like that by comparing modes on real queries, not by reasoning about it in the abstract.
#References
- Anthropic, "Introducing Contextual Retrieval" (2024). https://www.anthropic.com/news/contextual-retrieval (cookbook: https://platform.claude.com/cookbook/capabilities-contextual-embeddings-guide)
- LangChain,
RecursiveCharacterTextSplitterdocumentation. https://python.langchain.com/docs/how_to/recursive_text_splitter/ - supermemory,
code-chunk(AST-aware code chunking, MIT). https://github.com/supermemoryai/code-chunk - supermemory, chunking and customization docs. https://github.com/supermemoryai/supermemory
- Liu et al., "Lost in the Middle: How Language Models Use Long Contexts" (2023). https://arxiv.org/abs/2307.03172