AI Memory
V — The landscape · 9 min

How MemoryPlugin Works

A teach-first case study of one production memory system, showing how MemoryPlugin applies explicit facts, chat-history retrieval, curation, and cross-tool injection, with the honest tradeoff behind each.

The rest of this guide argues in the abstract: extraction, hybrid retrieval, conflict resolution, hierarchical summaries, portability. This page picks one system that had to commit to real answers and shows what those choices look like in production, including where they bite. The system is MemoryPlugin, which this wiki's authors build. Read it as a worked example, not a sales pitch. The interesting parts are the tradeoffs, and there is one behind every feature.

The bet is narrow, and worth saying out loud. Every big assistant now ships its own memory, and every one is a walled garden: ChatGPT remembers you inside ChatGPT, Claude inside Claude. MemoryPlugin's wager is that the memory should be yours, not the app's, and that it should travel between them. Two layers carry the bet. The first is discrete, editable facts, short statements you can actually read and fix, built like the old ChatGPT memory rather than the newer summary blob nobody can see into. The second is stranger and more useful: your chat history itself becomes the memory, synced and searchable, on the theory that you can never guess in advance which conversation will matter six weeks from now. One rule holds across both: the AI suggests and retrieves, it never silently writes. Nothing lands in your store without you, and you can see, edit, version, export, and delete all of it.

#The memory layer: facts you can see and edit

#Explicit memories

Start with the oldest complaint in the field: the assistant forgets you the second the tab closes, so you introduce yourself again, and again. A memory here is a plain short fact ("works at MemoryPlugin", "prefers TypeScript"), kept in Postgres and embedded into a Zilliz vector collection. The embedding choice is a lesson embeddings keeps hammering: Voyage's voyage-3.5-lite, truncated to 512 dimensions Matryoshka-style to halve storage and cost, indexed HNSW under cosine, with a BM25 sparse vector riding alongside for the exact strings embeddings blur. You add a fact by telling the assistant to remember it, or by accepting one it offers, and the write goes through a tool call, not magic text. Delete or merge one and it leaves a soft-delete trail pointing at what it became, so nothing simply disappears.

The catch is the deliberate flip side of "no silent writes": capture is not automatic. Somebody has to decide a fact is worth keeping, even if that somebody is the model when you prompt it (rag vs memory and smart extraction pull that apart). And write-time dedup is blunt on purpose, it catches only byte-identical text in the same bucket. The near-duplicates that say the same thing in different words are the curator's problem, two sections down.

#Buckets

One global pile of facts curdles fast. Work leaks into personal, project A smears into project B, and shoving all of it into every prompt burns tokens for nothing. Buckets are folders that keep memories apart. Every account gets an undeletable General bucket, and the bucket is the unit everything advanced runs on: suggestions, Smart Memory, and the knowledge graph all work per bucket, never across your whole store at once. The price: the tidying is on you, and a memory lives in exactly one bucket.

#Suggestions: the curator

Ask anyone who has run AI memory for a month what goes wrong and the answer is always the same. It rots. Duplicates breed, facts start contradicting each other, junk creeps in and drags down recall. The fix is an offline curator that runs per bucket. For each memory it pulls the nearest neighbours into a small cluster, hands them to an LLM (a Gemini Flash model), and asks for one of three moves: bin the junk, fold near-duplicates together (keeping the oldest as the anchor), or reconcile a fact that has changed ("works at Google" becoming "works at MemoryPlugin", carried as a "was X, now Y" progression instead of a quiet overwrite).

Two guardrails make it safe to point a model at your memory. Nothing is ever auto-applied: a suggestion is inert until you accept it, and you can edit the text first. And every ID the model hands back is checked against the input set, then re-checked for ownership, before a single write, because models invent IDs with total confidence. The full design is in memory suggestions, the deeper contradiction question in updates and conflicts. The honest limit: this is a batch you review, not live cleanup, and it leans toward keeping things, so some near-duplicates survive on purpose.

#Smart Memory: summaries first, load on demand

Even a spotless bucket has a ceiling. Inject all of it on every message and you bloat the context window, worse as the bucket grows. Smart Memory is the hierarchical dodge (hierarchical memory). Per bucket, an LLM sorts the memories into a handful of categories, writes a dense summary for each, and notes what detail is parked behind it and when fetching it is worth the trouble. At use time you get the summaries plus the most recent memories, and the assistant pulls a full category only when the conversation actually turns that way. Roughly 5,000 tokens of raw memory standing in as about 500. The cost: it needs enough memories to earn the overhead (the gate sits around 30), it skips very large buckets, the categorising is a manual action, and the categories set once and stay, so reorganising means starting over.

#The knowledge graph

Past a few hundred memories you lose the plot of your own bucket. The per-bucket knowledge graph draws the map: an LLM pipeline extracts entities, dedupes them deterministically by name, reviews and merges the variants, then pulls out directed relationships, with a deliberately stronger model (a Claude Sonnet) on the final review because nothing downstream cleans up after it. That last choice is the whole philosophy in miniature. Use a fast, cheap model where a later stage will catch its mistakes; pay for the good one only where its output is the final word. The honest limit, which knowledge graphs develops: the graph is something you look at, not something recall runs through, so it does not directly sharpen answers, and it is a second store to keep in sync, not a cure for facts that change over time.

#The chat-history layer: past conversations as memory

This is the actual wedge, the part that makes MemoryPlugin different. You cannot predict which conversation will matter later, so do not try. Rather than distil facts up front, keep the raw history and search it on demand. (Across tools makes the portability case in full.)

Sync. Two ways in: upload an export for a fast first import, or let the browser extension trickle conversations in from supported platforms as you go. Each one is parsed and chunked (around 256 tokens, light overlap), with a single chunking rule applied on purpose, role and timestamp prefixes live as structured fields and are stripped from the embedded text, because baking them in just adds noise. It is strictly opt-in. And the scale forces the design: a real export can run to 20 million tokens, so "just paste it into the context window" was never on the table. Chunk, index, retrieve, or nothing.

Recall. Indexing thousands of chats is pointless unless the right fragments surface on their own. So the pipeline expands your query into variants that match what you probably said back then, not search-engine queries for the answer, re-adds the verbatim query so rare names do not get washed out, runs hybrid dense-plus-BM25 search fused by Reciprocal Rank Fusion, reranks with Voyage rerank-2.5-lite, widens the window around each hit, and lets an LLM judge relevance into a token budget. Why fuse by rank instead of summing raw scores is the whole point of hybrid retrieval. The tradeoffs are real: every recall adds a few seconds, and quality leans on how well the host model wields tools. The hard one is that flat chunks do not model how facts evolve. Struggle with a topic in March and master it by June, and recall may not know which version is the current you (temporal memory).

Life Context. Raw history is too big to read, so a background job periodically renders it into a structured profile (work, personal, what is top of mind). The clever bit is what it refuses to believe: a topic has to turn up across several distinct conversations to count, and anything about other people, or material you were only helping with or critiquing, gets thrown out. It is the closest thing here to extracting facts from chat (smart extraction), done as a gated profile rather than silent per-message writes. The tradeoff: it is a snapshot, so it lags reality, and a genuinely new but rarely-mentioned fact is treated, by design, as probable noise.

#The connective layer: MCP and injection

All of it is worthless trapped in one place, so it reaches out through a couple of channels. Where the assistant can call a tool, it does: an MCP server (Claude, ChatGPT's connectors, Cursor and other MCP clients), local or hosted-and-remote over OAuth, plus a Custom GPT with Actions on ChatGPT. But every one of those connectors is off until the user adds and enables it, and a default chat has none wired up, so a browser extension injects memory and history straight into the page, inside ChatGPT, Claude, Gemini and other web apps. That second path is harder than it sounds and earns its own page: see memory without tool calling for why injecting your own memory can read to a model like an attack, and what to do about it. The tradeoffs in brief: the tool route only fires if the model decides to call it (a nudge in your instructions helps), and the extension rides each platform's page structure, so a vendor redesign can break it until it is patched.

#Images, files, and Ask

Three smaller pieces round it out, all premium and all sitting on top of the two core layers rather than inventing a third idea. Image memories let a text query surface a stored screenshot through a multimodal embedding (approximate, threshold-gated, in its own vector space). File buckets make uploaded PDFs, Word, and Markdown queryable with page-level citations (no scanned-image PDFs, 10 MB a file). And Ask is the one dashboard spot to query memories, history, or files and get cited answers back.

#Where it fits, and where it doesn't

It fits when you use several AI tools and want one memory across all of them, when you would rather see and correct what is stored than trust a black box, and when your best context is buried in chats you will never sit down and hand-curate. It fits badly if you need cryptographic privacy. The system is not end-to-end encrypted, and that is a deliberate call, not an oversight: the value comes from server-side search, summarisation, and synthesis, and real end-to-end encryption would kill all three. What you get instead is transparency and control, export, delete, audit, revoke, which is a genuine tradeoff to weigh with your eyes open rather than a guarantee. And it will not, today, reliably tell you which of two contradicting facts from your history is the current one. Nobody has solved that yet. It is the field's open problem, not a checkbox.

If you want to poke at the live version, it is at memoryplugin.com.

#References