III — Architectures · 10 min

Knowledge Graphs

How graph-structured memory answers the multi-hop and whole-corpus questions flat vector search cannot, from GraphRAG and HippoRAG to a production per-bucket pipeline, and how to tell whether the graph is earning its build cost.

Flat similarity search is good at one thing: finding the chunk that looks most like your query. Two kinds of question slip straight through it. The first is multi-hop, where the answer lives in the connection between facts that sit in different places. "Which of my projects use a library that the person who reviewed my last PR also maintains?" No single stored memory contains that sentence, so no single embedding is close to it. The second is global, the whole-corpus "what are the recurring themes across everything I have saved?" Again, the answer is a synthesis that no individual chunk holds. Knowledge graphs are the field's answer to both. They trade a heavier write path for structure that you can actually traverse, and the honest version of the story is that the trade is not always worth it.

#The architecture spectrum

There is no single "memory" architecture, only a spectrum ordered by how much structure you impose at write time. At the zero-structure end is flat vector RAG: embed chunks, store them, retrieve top-K by cosine. One step up is a recursive summary tree like RAPTOR, which clusters and summarises chunks bottom-up so a query can land on raw detail or a high-level theme (see hierarchical memory). Next is the entity graph (GraphRAG, LightRAG, HippoRAG), where extraction pulls entities and relations out of the text. Beyond it sits the temporal graph (Graphiti), which adds validity intervals to every fact so contradictions invalidate rather than overwrite (see temporal memory). Further still are self-organising note networks (A-MEM), where a new note can rewrite its neighbours, and OS-style tiered buffers (Letta), where the agent pages its own memory in and out. Each step right buys better multi-hop and global reasoning, and each costs more to build and maintain.

Figure 1. More structure (flat vectors to summary tree to entity graph to temporal graph to self-organizing notes to tiered buffers) buys better multi-hop and global reasoning at a higher index-time cost.

#Entity-relationship graphs

The core idea is small. Nodes are entities (people, products, projects, concepts) and edges are facts, usually a directed triple of source, relation, target. "Alice works at Stripe" becomes an Alice node, a Stripe node, and a works_at edge between them. For personal memory this produces a hub: a central user node, with entities radiating out and inter-entity edges threading them together. A graph's richness is mostly in those inter-entity edges, the ones that do not touch the user node, because they are what let traversal answer a question that a star of disconnected facts cannot.

Figure 2. A small entity-relationship graph: a central user node, entities as nodes, and labeled directed edges carrying the facts that connect them.

#GraphRAG: communities and global questions

Microsoft's GraphRAG targets the global question directly. It inserts three stages into the normal index loop. An LLM extracts entities and relationships from each chunk (default chunk size 1200 tokens, with a "gleanings" pass that re-prompts the model to catch entities it missed). The Leiden algorithm then partitions the graph into a hierarchy of communities, from small specific clusters up to broad themes. Finally, an LLM writes a natural-language summary report for each community. At query time, Global Search runs map-reduce over those community summaries: each summary produces a partial answer scored for helpfulness, and the partials are reduced into a final answer. There is also Local Search (entity-centric, "tell me about X"), DRIFT (local plus global with follow-up refinement), and Basic (plain vector RAG as a baseline).

The tradeoff is measured, not a vibe. Graph construction is one LLM call per chunk to extract plus one per community to summarise, so the index is expensive and static: indexing one multi-hop QA corpus reportedly took on the order of 115 million input tokens. Worse, the LLM-written summaries inject generated text that can hurt simple factual recall. The HippoRAG 2 paper reports GraphRAG averaging 49.6 F1 against its own 59.8 across a mix of tasks. Summarisation is double-edged: it unlocks sensemaking and dilutes precision at once.

#HippoRAG: one graph walk instead of many LLM calls

HippoRAG attacks multi-hop from a different angle, borrowing the hippocampal indexing theory of human memory. The LLM is the neocortex that extracts a graph of triples once, offline. At query time it does no iterative LLM-in-the-loop reasoning at all. It links the query's entities to graph nodes, seeds them, and runs Personalized PageRank over the graph. A single walk follows the multi-hop associations that iterative retrievers need many round-trips for. Seeding is weighted by node specificity, an IDF-like signal where a node that appears in few passages counts for more, so rare, specific concepts pull harder.

The numbers are the selling point. HippoRAG reported up to 20% gains on multi-hop QA, with single-step retrieval matching or beating iterative methods while being 10 to 30 times cheaper and 6 to 13 times faster. Its index is far lighter than GraphRAG's too, roughly 9 million input tokens on the same corpus that cost GraphRAG around 115 million. The lesson worth taking: a graph plus a cheap graph algorithm can deliver associative, single-shot retrieval that vanilla vector search structurally cannot, without paying for per-community summarisation.

#Building one for real: a per-bucket pipeline

In MemoryPlugin the knowledge graph is generated per bucket (one directed concept graph for a given user and bucket), and the pipeline alternates cheap extraction with quality review. A fast, lean model (a Gemini Flash-Lite tier) extracts entities first, in parallel batches, with every self-reference ("I", "me", "my") collapsed into one central user node. A deterministic pass merges entities by a normalised slug, so "Node.js" and "NodeJS" become one node for free. A second LLM pass reviews the entities, folding in spelling and abbreviation variants and stripping noise (abstract values, feelings, jargon), with the user node protected. Only then does relationship extraction run over the cleaned list, under a hard rule: an honest graph with fewer edges beats one full of forced connections, so a generic "related to" is banned and "part of" must mean literal containment. A stronger model (a Claude Sonnet tier) does the final review, where judgment matters most: dropping fabricated edges, fixing wrong verbs or directions, adding high-confidence missing ones. A last deterministic prune removes any entity that touches only the user node and appeared in a single memory.

The principle underneath is model routing by task shape. Extraction fans out across many parallel calls and has two review stages behind it, so a cheap model is fine because the reviews clean up after it. The final review has nothing downstream to catch its mistakes, so it gets the stronger model. (The cost and latency page develops this asymmetry.)

#Ontology grounding, so the edges mean something

Left to its own devices, an LLM will happily invent edges. Cognee's answer is to validate extracted triples against a domain ontology during its Cognify stage, so a relation that does not fit the schema is rejected before it pollutes the graph. Graphiti reaches the same goal through typed entity and edge definitions (Pydantic models), with a map of which relation types are even allowed between which entity types. You can prescribe the ontology up front or let structure emerge from the data, but the grounding is what keeps a graph from filling with plausible-sounding garbage.

#Is the graph actually helping?

Graph traversal adds latency, and at scale that latency spikes; one builder described a voice agent with a dense graph producing pauses long enough to make callers ask for a human. A blunter community refrain: if you cannot make the case for a graph yourself, you probably do not need GraphRAG, and dense technical documents often benefit more from better chunking that preserves tables and cross-references. A lighter alternative covers many needs without a second store to keep in sync: typed structured records (entity, attribute, value, timestamp) make updates into actual updates, with relationships as reference fields in the same store, and you reach for true traversal only when you need it (temporal memory covers this fork).

So separate the failure you are seeing before building one. Wrong-looking chunk returned? That is a chunking or hybrid-retrieval problem, and a graph will not fix it. Plausible chunks returned but no way to connect facts across them, or no way to answer a whole-corpus question? That is where a graph earns its considerable cost. The landscape page maps which tool fits which case.

#References

Edge et al., "From Local to Global: A Graph RAG Approach to Query-Focused Summarization" (2024). https://arxiv.org/abs/2404.16130 (code https://github.com/microsoft/graphrag, blog https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/).
Gutiérrez et al., "HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models" (NeurIPS 2024). https://arxiv.org/abs/2405.14831 (repo https://github.com/OSU-NLP-Group/HippoRAG). Follow-up "From RAG to Memory" (HippoRAG 2), https://arxiv.org/abs/2502.14802 (the 49.6 vs 59.8 F1 and index-cost comparisons).
Cognee, ontology-grounded ECL knowledge-graph pipeline. https://github.com/topoteretes/cognee
Graphiti (Zep), typed temporal knowledge graph with custom entity/edge schemas. https://github.com/getzep/graphiti
r/Rag, "Cognee vs Graphiti vs Mem0" (graph latency at scale, "is the graph even helping vs better chunking," reported as practitioner experience).