Why a Bigger Context Window Is Not Memory
A larger context window gives a model more room to work, not a memory; this page explains why inclusion is not usage, where the RAM-versus-disk line falls, and when you actually need a store.
"Won't a bigger context window just solve this?" is the first thing people say when you tell them you are building memory. The window keeps growing: 8K, then 128K, then a million tokens and counting. If the model can just read everything, why bother with a store, retrieval, and all the curation around it?
Because a context window is where a model thinks, not where it remembers. Everything in the window is reconstructed from scratch on every call and discarded when the session ends. Nothing carries to the next conversation unless something outside the model wrote it down. A bigger window gives the model more room to work in a single turn. It does not give it a past.
#Working memory and the store
The cleanest mental model comes from MemGPT, which framed the context window as RAM and an external store as disk, with the model paging data between the two (Packer et al., 2023). The community reaches for the same picture independently. One widely shared post put it as "treating context as memory is like treating RAM as a hard drive: it is volatile, expensive, and gets slower the more you fill it up."
The analogy holds where it counts. Context is volatile: close the tab and it is gone. It is expensive: every token in the window is re-read and re-paid on every single call, so a conversation dragging a 200K-token history along pays for all 200K each turn. And it is the working set, not the archive. Memory is the disk: a durable, editable store that survives sessions and loads only the relevant slice into the window when it is needed.
The analogy breaks in one important way, and the break is the whole point of this page. RAM gives you uniform access: the millionth byte is as fast to read as the first. A context window does not. The model does not use all of its context equally.
#Lost in the middle
Liu et al. (2023) measured this directly. They moved the answer-bearing document around inside a long context and watched accuracy trace a U-shaped curve: models use information best when it sits at the very start or the very end of the window, and markedly worse when it sits in the middle. The dip shows up even in models built and marketed for long contexts, and when the needle is buried mid-context, performance can fall below the closed-book baseline: the model does worse with the document than with no document at all.
So inclusion is not usage. Putting a fact in the window does not guarantee the model will use it, and adding more marginally relevant material actively hurts by pushing the good stuff into the dead zone. This is the empirical reason "just give it everything" fails, and the reason retrieval, ranking, and careful placement keep mattering no matter how large the window gets.
#The real bottleneck is salience
Window size is not the scarce resource. Salience is: the hard question is never "can I fit it all" but "which few things should be in front of the model right now." A sharp version of this from the community is that relevance shifts within a single answer, where part one of a response needs one fact and part three needs another, and a system that dumps a static blob into context cannot adapt to that. The job of a memory layer is selection: surface the handful of items that matter for this turn, in a position the model will actually read, and leave the rest on disk.
This is also where memory and retrieval stop being the same thing. Choosing what to keep, reconciling what changed, and deciding what to surface is a different problem from fitting text into a prompt. See retrieval-augmented generation for the read path, and RAG vs memory for why the right-looking chunk is not the same as the right remembered state.
#When you do not need memory yet
Memory earns its complexity at scale. Anthropic's guidance is a useful honest boundary: for a knowledge base under roughly 200K tokens, about 500 pages, skip the retrieval layer and put the whole thing in the prompt; prompt caching makes that cheap, and you sidestep every retrieval failure mode by never retrieving. One practitioner's rule of thumb lands in the same place, treating ~200K as the comfortable working ceiling before answer quality starts to slide. Above that, a memory or retrieval layer starts paying for itself.
#The honest opposing bet
There is a real counter-position, and it is not a strawman. One camp argues memory is a temporary problem that cheaper, longer context will erode: 10M-plus windows, linear-attention architectures, and better KV-cache quantisation could push the boundary up year over year. If windows get long and cheap enough, and if lost-in-the-middle gets solved, the threshold where you need a store moves up with them and more workloads simply fit. A builder even reported that for working and execution state, dedicated memory systems were far more expensive (one write-up claimed 14 to 77 times) and less accurate than just passing the full history, because running an extraction model on every message is its own cost. Treat that as a community claim rather than a settled number, but the direction is fair: do not reach for a memory pipeline before the window forces you to.
Two things survive even the optimistic case. You still re-process every token you keep in context, every turn, so cost scales with what you carry. And real histories grow without bound while any window is finite. As long as both hold, selection, deciding what to keep and what to surface, is the durable problem. The window is where the model thinks. Memory is what it gets to keep.
#References
- Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Percy Liang (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172. https://arxiv.org/abs/2307.03172
- Anthropic (2024). Introducing Contextual Retrieval (the under-200K-token guidance and prompt-caching economics). https://www.anthropic.com/news/contextual-retrieval
- Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, Joseph E. Gonzalez (2023). MemGPT: Towards LLMs as Operating Systems (context-as-RAM, store-as-disk framing). arXiv:2310.08560. https://arxiv.org/abs/2310.08560
- r/LocalLLaMA discussion, "The 'Infinite Context' Trap: Why 1M tokens won't solve Agentic Amnesia" (community framing of RAM-vs-disk, salience, and the opposing bet). https://www.reddit.com/r/LocalLLaMA/comments/1qkrhec/