AI Memory
I — Foundations · 8 min

Embeddings and Vector Search

How text becomes geometry, why cosine is usually a dot product, and the dimension, truncation, and input-type gotchas that quietly decide retrieval quality.

You have fifty thousand stored memories and a new question arrives. You cannot read all fifty thousand to find the handful that matter, and a plain keyword match misses "Puma" when the memory says "my new running shoes." Embeddings solve both problems by turning text into geometry: similar meaning becomes nearby points, and "find the relevant memories" becomes "find the nearest points to this one."

#What an embedding actually is

An embedding model maps a string to a fixed-length list of floating-point numbers, a vector. "I switched to Puma" and "my new running shoes are great" land close together; "I deployed the API to prod" lands far away, in a different region. No single number in the vector means anything on its own. Only relative position carries signal, so the only useful operation is comparing one vector to another.

The models you will actually meet are a short list. OpenAI's text-embedding-3-small and -large, Voyage's voyage-3.5 family, and open-weight options like the BGE series cover most production systems. The two sister products behind this guide pick differently for their domains: AskLibrary (a chat-with-your-books RAG product) embeds book chunks with text-embedding-3-small at its default 1536 dimensions, while MemoryPlugin embeds short user memories with voyage-3.5-lite.

EMBEDDING SPACE · 2-D projection query vector nearest neighbours = retrieved other memories similar meaning = nearby vectors
Figure 1. Embeddings place similar text near each other. An ANN index finds the query's neighbors with a few graph hops instead of comparing against every stored vector.

#Cosine, dot product, and the normalisation trick

There are two common ways to score how close two vectors are. Cosine similarity measures the angle between them. Dot product measures that same angle but weighted by how long each vector is. For text you almost always want cosine, because a vector's length is an artifact of the model, not a feature of the meaning, and you do not want a longer passage to score higher just for being long.

Here is the trick most production systems use. If you normalise every vector to length 1 before you store it, cosine similarity and dot product become the identical number, so you can drop the magnitude division and compute a plain dot product on the hot path. Supermemory does exactly this: it normalises embeddings to unit vectors once, then treats every comparison as a dot product, and clamps negative scores to zero. The speed win is real but small. The bigger reason to pre-normalise is correctness, because it strips magnitude out of the comparison entirely so you are always measuring direction.

#You don't scan every vector: ANN and HNSW

Comparing a query against every stored vector is a brute-force scan, linear in the number of memories times the dimension. That is fine for a thousand vectors and hopeless for ten million. Approximate Nearest Neighbour (ANN) indexes buy back the speed by accepting "almost the closest" instead of "provably the closest."

The dominant ANN structure is HNSW (Hierarchical Navigable Small World): a layered proximity graph where search starts at an entry node and greedily hops to closer and closer neighbours, descending layers until it lands in the query's neighbourhood. It touches a tiny fraction of the vectors. MemoryPlugin builds its 512-dimension memory index with HNSW (M=16, efConstruction=200) under a cosine metric. The search-time candidate width (ef) is the knob you turn to trade latency for recall: widen it and you find more true neighbours but spend more time.

#Dimensions and Matryoshka truncation

More dimensions give the model more room to encode distinctions, but every extra dimension is more bytes on disk and more arithmetic per comparison. You do not always need the model's full width.

Matryoshka Representation Learning trains a model so that the first k numbers of a vector are themselves a usable embedding, nested like Russian dolls. You ask for fewer dimensions and keep most of the quality. MemoryPlugin uses this deliberately: it forces Voyage to emit 512 dimensions instead of the model's native 1024, halving storage and speeding every comparison for a small, measured quality cost. Graphiti (the temporal-graph engine behind Zep) does the same thing in the other direction, truncating text-embedding-3-small down to 1024 dimensions.

Two rules come with truncation. Only truncate a model that was trained for it; lopping the tail off a non-Matryoshka vector destroys the meaning rather than compressing it. And re-normalise after you truncate, because a prefix of a unit vector is not itself unit length, which quietly breaks the cosine-equals-dot-product assumption from the section above.

#The asymmetric input gotcha

This one is easy to miss and quietly leaves recall on the table. Several modern embedding models accept an input type: query for the short thing a user typed, document for the longer text you stored. The model embeds the two slightly differently on purpose, because a question and the passage that answers it are not phrased alike. A question is short and interrogative; the answer is declarative and detailed. Telling the model which side it is looking at improves the match between them.

If you embed both your stored text and your search queries with the same input type, you give up that improvement for free. It is a common mistake, and an honest one to own: in MemoryPlugin's current build, search queries are embedded with the document input type, the same as stored memories, a quality lever we have not pulled yet. When your model supports asymmetric inputs, embed stored text as document and the user's query as query.

#Embeddings are the cheap part of the system

Keep the cost in perspective so you optimise the right thing. Embedding text is close to free. voyage-3.5-lite runs around two cents per million tokens, roughly $0.00002 per thousand. The LLM calls that surround the vectors (extracting facts, deduplicating, writing summaries, building knowledge graphs) cost orders of magnitude more per token. So embed generously, re-embed when you need to, and spend your cost-control attention on the model calls, not the vectors.

One corollary: if you ever change embedding models or dimensions, every stored vector has to be re-embedded with the new model, because vectors from two different models are not comparable. Supermemory plans for this by keeping two embedding columns per memory (an old one and a new one), so it can dual-write and cut search over to the new model without downtime.

#References