AI Memory
II — From retrieval to memory · 9 min

Smart Extraction

How a raw conversation becomes a small set of facts worth keeping, and the design choices that decide whether you remember the right ones.

A single chat turn often carries one fact worth keeping wrapped in forty words of packaging. "Yeah sounds good, but actually I moved to Berlin in March so ship it to the new address." The job of extraction is to pull "User moved to Berlin in March 2026" out of that and let the rest evaporate. Lean too far one way and your store fills with "User said yeah sounds good." Lean too far the other and you drop the address change you will need next week. Extraction is the model deciding, on your behalf, what is worth remembering, and almost every hard call in a memory system traces back to how that decision is made.

#The extraction call

It's one LLM call. You hand the model a turn, or a short window of turns, and it returns structured output: a list of candidate facts, usually as JSON. The first thing a good extraction prompt does is learn to refuse. mem0's fact-retrieval prompt teaches this with its smallest example. The input "Hi." returns an empty list. "Hi, my name is John. I am a software engineer." returns two facts, "Name is John" and "Is a software engineer." Greetings produce nothing; substance produces atoms.

That call sits inside a short pipeline. The model proposes candidates, the candidates are checked against what you already have, and only what survives gating gets written. Everything downstream depends on the quality of the proposal step, because a fact that is never extracted can never be retrieved.

EXTRACTION · per turn Conversation turn LLM extract Candidate facts · likes oat milk · lives in Berlin · “Hi!” dedup + gate frequency + provenance Stored memories · likes oat milk · lives in Berlin Discarded “Hi!”
Figure 1. From conversation to stored fact: an LLM extraction call proposes candidate facts, which pass through dedup and gating before anything is written.

#Atomic facts, or rich ones?

For a long time the doctrine was atomicity: break everything into the smallest standalone statements. "User has a dog." Clean, easy to deduplicate, easy to reason about. mem0's V3 rewrite reversed that. Its current extractor is told to prefer contextual richness: "User has a dog named Poppy and their morning walks together are the highlight of their day" beats the bare version, because the bare version strips the connective tissue retrieval later needs to surface it for the right query. Transitions matter too. "User switched from almond milk to oat milk lattes after developing an almond sensitivity" captures the old state, the new state, and the reason in one record.

The guardrails make this safe rather than rambling. mem0 sets a budget of roughly 15 to 80 words per memory, up to 100 when the detail warrants it, and forbids dropping a proper noun, title, date, or number to hit a word count. "Osteria Francescana," not "a new restaurant." The honest cost: richer facts are harder to deduplicate by exact match, so near-duplicates accumulate and the burden of picking the current one shifts to the read path. mem0 accepts that trade deliberately, the subject of updates and conflicts.

#Anchor every date to when it was said

"User went to Paris last week" is a useful memory for about seven days and a liability after that. The fix is temporal grounding: resolve every relative date against the date the conversation happened, not the date you are processing it. mem0's extractor is explicit that "last week" must be converted using the observation date, so the memory becomes "User went to Paris the week of May 15, 2023" and stays meaningful forever.

#When in doubt, extract

mem0's extractor carries a blunt instruction: when uncertain, extract. A redundant memory is far cheaper than a missing one, because deduplication can catch a true duplicate later but nothing can recover a fact you never wrote down. This guards against a specific, named failure mode worth knowing by name: first-topic dominance. On a multi-topic turn, the model tends to mine the first subject thoroughly and then treat everything after it as filler, silently dropping the later facts. mem0 fights it with an explicit checklist. For a conversation of ten or more messages you should typically end up with five to fifteen memories, and if you have fewer than three, re-read before finishing.

#Who actually said it

A transcript has multiple voices, and attributing a fact to the wrong one poisons the store. "User was recommended Osteria Francescana" is a different memory from "User stated they love Osteria Francescana." mem0 ships role-specialised prompts for this, and hardens the user-only variant with a deliberately stern line: "YOU WILL BE PENALIZED IF YOU INCLUDE INFORMATION FROM ASSISTANT OR SYSTEM MESSAGES." The all-caps penalty framing is a real tactic, not decoration; attribution drift is common enough to warrant the threat. The same concern covers group chats, where the "assistant" turn may be a named human ("Maria: I just got a cat named Bailey") whose facts get attributed by name, and pasted documents, where the move is to extract the content ("Bajimaya v Reward Homes, construction began 2014") rather than the act ("User shared a case summary").

#How much does this matter?

Not every true fact deserves equal weight, and you can score that at write time. Generative Agents, the Stanford simulation that first made memory ranking concrete, asks the LLM to rate each memory's importance on a poignancy scale of 1 to 10: making the bed scores a 1, a breakup or a college acceptance scores a 10. That score is stored with the memory and later combined with recency (an exponential decay) and relevance (embedding similarity) when ranking what to surface. Scoring salience as you write keeps trivia from drowning out the events that matter. That ranking and decay machinery is the subject of forgetting.

#The other stance: propose, do not write

Everything above assumes the system should mine memories from chat automatically. MemoryPlugin takes the opposite position by design: no silent writes. The AI curates and retrieves, but it does not quietly create memory rows from your conversations; creation stays explicit, driven by a person or an intentional tool call. The reasoning is earned, not theoretical.

The closest thing to extract-from-chat in that system is deliberately not a memory writer. It is Life Context, a structured profile synthesised from chat history through retrieval, producing a profile document rather than memory rows. The gating is what makes it trustworthy. A topic has to appear in three or more distinct conversations to be included; something seen in only one or two is treated as likely noise and excluded. On top of that sits a provenance guard: exclude information about other people, content the user is merely helping someone else with (a friend's resume pasted in for review), or posts the user is only critiquing. Frequency answers "is this persistent and actually about the user," provenance answers "is this even theirs."

When the model proposes rather than writes, a human-in-the-loop curator catches what extraction gets wrong before it lands, the subject of memory suggestions. The reason rich, well-attributed, time-grounded facts are worth this much care: retrieving a similar-looking chunk is not the same as remembering the right state, the distinction at the heart of RAG versus memory.

#References

  • mem0, open-source memory layer for AI agents: github.com/mem0ai/mem0. Extraction prompts (fact-retrieval, role-specialised attribution, V3 additive extraction with contextual richness, temporal grounding, and the when-in-doubt and first-topic-dominance instructions) are in the repository, with the V2 to V3 redesign narrated in the project's oss-v2-to-v3 migration documentation.
  • Park et al., 2023, "Generative Agents: Interactive Simulacra of Human Behavior": arxiv.org/abs/2304.03442. Source of the memory stream and the LLM-rated importance (poignancy 1 to 10) used to score salience at write time.
  • MemoryPlugin (one production system): the no-silent-writes stance and the frequency-plus-provenance gating of synthesised profiles (Life Context) are described as build details, framed here as illustrative rather than a blueprint.