III — Architectures · 8 min

Hierarchical Memory

Summaries-first memory that keeps cheap overviews resident in the prompt and the full detail one fetch away, covering RAPTOR's recursive summary tree, MemGPT's tiered blocks, and a production load-on-demand category index.

You have ten thousand memories, or a five-hundred-page document, and a context window that fits a small fraction of either. You get two bad options. Stuff in everything you can and pay for tokens you mostly will not use, while the model loses the relevant fact in the middle of the pile (the Lost in the Middle effect: a buried fact is used worse than one at the edges). Or retrieve a handful of chunks by similarity and hope the answer lives in one of them, which fails the moment the question is broad ("what are the themes here?") rather than narrow.

Hierarchical memory takes a third path. Keep cheap summaries resident in the prompt at all times, and keep the expensive detail one fetch away. The model reads an overview and pulls the raw content only when it decides it needs to. Three designs implement this idea at different scopes: a recursive summary tree over a corpus (RAPTOR), a temperature-tiered working set for an agent (MemGPT/Letta), and a category index over a user's accumulated memories that loads on demand (one production system).

#RAPTOR: a tree of summaries over a corpus

RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval) builds the hierarchy bottom-up at index time. Start with leaf chunks of the source text and embed them. Reduce those embeddings to a low dimension with UMAP, then soft-cluster them with a Gaussian Mixture Model, where a single chunk can belong to more than one cluster and the Bayesian Information Criterion picks how many clusters to form. An LLM writes one summary per cluster, that summary becomes a parent node, and the process repeats on the parents. You end up with a tree whose leaves are raw passages and whose upper nodes are progressively broader summaries.

The retrieval trick is what makes the tree pay off. In RAPTOR's "collapsed tree" mode, every node at every level (leaves and summaries alike) goes into one flat pool, and a query does ordinary top-k over all of them. A precise question ("what was the exact figure in Q3?") lands on a leaf; a thematic question lands on a high-level summary. The query finds its own altitude instead of being forced to one granularity. On the QuALITY reading-comprehension benchmark paired with GPT-4, RAPTOR reported roughly a 20% absolute accuracy gain.

The honest cost is summarisation noise. RAPTOR's nodes are LLM-written prose, not extracted entities, which makes the tree cheap to reason over but means every upper node is a lossy compression that can drop a name or a number. On mixed benchmarks that include simple factual questions, summary-heavy indexes can lose to plain retrieval: the HippoRAG 2 evaluation put RAPTOR at a 48.8 average F1, below a graph-based approach. Summaries help global questions and can hurt local ones. (See knowledge graphs for the structured alternative that trades summarisation for entity extraction.)

#Tiered memory: hot, warm, and cold (MemGPT / Letta)

RAPTOR organises a static corpus. MemGPT organises an agent's running memory by temperature, borrowing the operating-system idea of paging data between fast and slow storage. There are three tiers. Core memory is always in the prompt: a small block of persona and key user facts the agent can read and rewrite directly. Recall memory is the full conversation history, kept outside the window and searched on demand. Archival memory is a long-term store the agent queries through tool calls when it needs something cold. Core is RAM, archival is disk, and the model itself decides what to page in.

In modern Letta this is expressed as memory blocks. A block is a reserved section of context with a text value and a character limit; individual blocks like the human and persona blocks default to a 20,000-character budget, and the overall core unit can run up to 100,000 characters. The teachable detail is that the model is shown chars_current and chars_limit for each block on every turn. It can see its own budget and knows when a block is filling up, rather than discovering the ceiling by being silently truncated.

Eviction follows the same OS metaphor. When the running history reaches about 70% of the window, the system injects a warning that tells the agent to save anything important into core or archival memory before messages are dropped. At 100% it flushes roughly half the window and folds what it evicted into a recursive summary, so a fixed window can simulate an unbounded one. The pattern: make the budget visible and give the model tools to manage it, instead of trimming behind its back. (More on decay and eviction in forgetting.)

#A production middle ground: category index plus load-on-demand

Between a static document tree and a self-editing agent sits the common consumer case: a person who has accumulated thousands of memories and connects them to ChatGPT, Claude, or Gemini. Injecting all of them is expensive, triggers Lost in the Middle, and (as covered in memory hygiene) a bloated, low-quality memory list can even give a safety-trained model a reason to ignore the whole protocol.

MemoryPlugin's "Smart Memory" handles this per bucket in three steps. First, identify four to six high-level categories for the bucket (or use the user's own presets), chosen to separate stable long-term traits from actively evolving projects and to stay broad enough that they keep working as more memories arrive. Second, assign each memory to exactly one category. Third, generate per category a dense summary under 200 words plus a separate block under 150 words describing what else lives in that category and when an AI should load it. At use time the system injects every category's summary and its "what else is here" block, plus roughly the 30 most recent memories, and instructs the model to load a category's full memories on demand (via a tool call) when the conversation turns to that topic. A bucket that would inject around 5,000 tokens of raw memory injects around 500 instead, about a 90% cut, and the prompt does not grow as the bucket does.

Figure 1. Summaries-first: cheap category overviews stay resident, and the model pulls one category's full memories only when the conversation needs them.

Two design choices keep it stable. There is a gate: buckets with fewer than about 30 memories are skipped entirely, because below that size the index costs more than it saves and you may as well inject everything. And categories are permanent once created; existing ones are reused rather than regenerated, so a user's mental map of their memory does not reshuffle on every run. A single pass is capped at 2,000 memories or 600,000 tokens, and the compression ratio is logged on every run.

#The shared tradeoff

The risk is built into the idea. A summary is a lossy compression, and every level up is another place to lose a date, a quantity, or a proper noun. If a question needs the exact figure and the model never drills down, it answers from the overview and is confidently, subtly wrong. So the drill-down has to actually fire. RAPTOR keeps the leaves in the same searchable pool as the summaries, so precise queries still reach raw text directly; the category approach adds an explicit "when to load this" hint so the model knows the trigger.

Hierarchy earns its keep when the corpus dwarfs the context window and questions vary in altitude. When everything fits, it is pure overhead: under roughly 200K tokens, the honest move is to skip the machinery and put it all in the prompt (context vs memory sets that boundary).

#References

Sarthi et al., "RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval," ICLR 2024, arxiv.org/abs/2401.18059 (UMAP + GMM soft clustering with BIC, recursive summary tree, collapsed-tree retrieval, ~20% gain on QuALITY with GPT-4). Code: github.com/parthsarthi03/raptor.
Packer et al., "MemGPT: Towards LLMs as Operating Systems," 2023, arxiv.org/abs/2310.08560 (virtual context management; core, recall, and archival tiers; 70% warning and 100% flush with recursive summarisation).
Letta (the production successor to MemGPT), github.com/letta-ai/letta (memory blocks with visible chars_current / chars_limit budgets; default character limits of 20,000 for persona and human blocks up to 100,000 for the core unit).
Gutiérrez et al., "From RAG to Memory: Non-Parametric Continual Learning for LLMs" (HippoRAG 2), 2025, arxiv.org/abs/2502.14802 (the mixed-benchmark comparison placing RAPTOR at a 48.8 average F1; summarisation noise hurting simple factual QA).
Liu et al., "Lost in the Middle: How Language Models Use Long Contexts," TACL 2024, arxiv.org/abs/2307.03172 (why stuffing more context can hurt, motivating summaries-first designs).
MemoryPlugin "Smart Memory" (one production system): per-bucket four-to-six category index, summary under 200 words plus a load-hint block, summaries plus ~30 recent memories injected with load-on-demand, ~90% token reduction, and a 30-memory gate.