
Information Retrieval is, at its core, about finding things similar to what you are looking for. From library card catalogs to Google Search to modern RAG pipelines, the premise is that similarity implies relevance. This mirrors how the human mind works: intuition emerges from similar experiences stored in memory.
But what does “similar” mean? Two documents can be similar in topic, writing style, length, author, or sentiment. A medical paper and a patient forum post might discuss the same disease but share almost nothing else. Similarity is not a single thing. It is a lens, a choice of what to pay attention to and what to ignore. Furthermore, the angle you choose to look for similarity determines what you find, and that angle shapes how relevance is evaluated.
There are many ways to measure similarity. Graph networks capture relationships. Knowledge bases encode explicit links. Sparse methods like BM25 count term overlap. But the dominant primitive in LLM-era retrieval is vector similarity: represent everything as points in high-dimensional latent space, then find nearby points. I will not dive into the academic debates over how much RAG is needed versus what the base model can handle natively.1 Instead, I want to focus on practical implementation: how do we determine relevance and ensure it functions effectively in mission-critical AI applications - in practice?
Encoding Relatedness from Primitives
Two questions guide this section: how is similarity defined at its core, and how do we use it to calculate relevance?
Closeness is the natural proxy for similarity. Start with the simplest case: three dots on a number line at positions 2, 3, and 9. Which pair is most alike? Subtract and take absolute values. The distance from 2 to 3 is 1; from 2 to 9 is 7 - hence 2 and 3 are closer and thus more “similar”. This is the entire idea - the primitive - everything else can be generalized using this core idea.
Now, we can add another dimension, but we will forgo the distance comparison and focus on defining distance for brevity. Let’s visualize two points mentally: Point A sits at coordinates (1, 2). Point B sits at (4, 6). You could walk there Manhattan-style: 3 steps east, then four steps north, for a total of 7 steps. Or you could walk directly, cutting diagonally across: the Pythagorean theorem gives . Same two points, different answers. The Manhattan path sums coordinate differences. The Euclidean path computes the hypotenuse. Neither is wrong; they encode different philosophies about what distance means.
Add a third dimension to reach the reality we live in, and the pattern continues. Going further, where our intuition fails, such as in 768-dimensional space, the same logic applies. You can’t visualize it, but the math is the same. This is where text embeddings reside.
But why do we have to deal with such high dimensionality with text? Let me explain. For machines to find similar things, they need a representation that makes similarity computable. Text is variable-length, symbolic, and riddled with ambiguity (not to mention the different languages across the world). Machines need something more structured. The modern answer is embeddings: fixed-length lists of floating-point numbers, typically 768 or 1536 dimensions. Pass any text through an embedding model, and you get a vector. These models are trained so that semantically related inputs produce vectors that are close together in the latent space. “Dog” is closer to “canine” than “cat”; questions and their answers cluster. The geometry becomes a proxy for meaning. Relatedness collapses into the same question we already answered: which points are closer?
This question of closeness is precisely what modern retrieval systems answer at scale. Vector databases index billions of embeddings and return nearest neighbors in milliseconds. The entire RAG paradigm depends on this: embed the query, find nearby document embeddings, feed those documents to the LLM as context. When the embedding model captures the correct notion of similarity, this works remarkably well.
But here is the critical point: geometry is learned, not given. Once the dimensions and the method to measure similarity are determined, where entities live in the latent space is not determined beforehand. Embedding models optimize a loss function, typically contrastive learning2 that pushes related pairs together and unrelated pairs apart. The positions are artifacts of training, not inherent properties of concepts. Different models place the same concepts in other locations.
Furthermore, embedding models are typically separate systems from the LLMs they serve: encoders trained for similarity, not decoders trained for generation. The geometry that determines what gets retrieved may not align perfectly with how the LLM understands meaning. The geometry is contingent on multiple levels, which makes the precise definition of ‘closer’ all the more consequential.
Before going further, two mathematical definitions are worth stating clearly. A norm is a function that measures the “size” or “length” of a vector. If you know absolute value, you already understand the idea: absolute value measures the size of a single number (its distance from zero), while a norm generalizes this to vectors with many components. It takes a vector and returns a single non-negative number, written as . A unit vector is simply a vector whose norm equals 1. Any vector can be converted to a unit vector by dividing it by its norm, a process called normalization. Unit vectors preserve direction while discarding magnitude.
The Lp norms are a family of norms indexed by a single parameter:
The distance between two vectors is then the norm of their difference: . When p equals 1, you get the L1 norm and Manhattan distance. When p equals 2, you get the L2 norm (also called Euclidean norm) and the familiar straight-line distance. L2 dominates in practice: it is the default in most vector databases and the basis for cosine similarity. As p approaches infinity, only the largest single component matters. Each choice encodes a different philosophy: L1 says many minor disagreements accumulate; L2 balances aggregate and individual deviations; L-infinity says one bad mismatch ruins everything.
But there is another way to think about closeness: direction rather than position. Earlier, I noted that the angle you choose shapes how you evaluate similarity. Here, angle becomes literal. Two vectors might be far apart in space but still point in the same direction. Cosine similarity measures exactly this:
The result ranges from -1 (opposite directions) through 0 (perpendicular) to 1 (same direction). By normalizing away magnitudes, cosine similarity ignores how “long” the vectors are and focuses purely on where they point.
The relationship between distance and angle is worth understanding precisely. For any two vectors, the L2 distance squared expands to:
That last term, the dot product, is what cosine similarity normalizes. For unit vectors where both lengths equal 1, the formula simplifies:
This is the key equivalence: minimizing distance is mathematically equivalent to maximizing cosine similarity for unit vectors. They produce the same ranking. But for vectors of different lengths, they can disagree substantially. Two short vectors near the origin might be close in distance but point in entirely different directions. Two vectors pointing the same way might be far apart if one is much longer than the other. Distance cares about both where you end up and how far you traveled. Angle only cares about which direction you are pointing.
Consider ‘animal,’ ‘dog,’ and ‘canine.’ All three point in a similar direction: they inhabit the same semantic territory, and cosine similarity rates them as related. But ‘dog’ and ‘canine’ are synonyms; they should be nearly identical, not just directionally aligned. L2 distance captures this distinction: ‘dog’ and ‘canine’ sit close together in space, while ‘animal’ sits farther away despite pointing in the same direction. Cosine tells you they are related. L2 tells you how closely.
This is why most systems normalize embeddings to unit length before indexing. Normalization projects everything onto a hypersphere, where L2 distance and cosine similarity reduce to a single measure: nearby points have the same direction. The choice of metric becomes a matter of computational convenience rather than semantic significance. However, normalization removes magnitude; if the embedding model encodes meaningful information in vector length, that information is lost.
The primitive question remains: which points are closer? The answer depends on how you measure, what you normalize, and what the embedding model learned. The geometry is contingent at every level. With these foundations in place, we can examine how production systems navigate these tradeoffs.
The Pragmatics of Proximity
Modern retrieval systems use a two-stage architecture that effectively leverages these metrics. The first stage runs a fast approximate search, scanning billions of embeddings and returning hundreds of candidates in milliseconds. Speed matters here; precision is secondary. Cosine similarity or inner product works because the goal is coverage, not perfection: do not miss the relevant documents. The second stage applies a more sophisticated model, a cross-encoder or late-interaction model like ColBERT3, that jointly processes the query and document. Here, the model makes nuanced judgments that simple vector comparison cannot.
The insight is that first-stage retrieval does not need to be perfect. It needs not to miss. The reranker fixes ordering; the retriever sets the search space. This explains why cosine similarity dominates despite its limitations: for candidate generation at scale, “good enough” in milliseconds beats “perfect” in seconds.
A few practical considerations follow. First, match your inference metric to your training objective. If your embedding model was trained with dot-product loss, use dot products at inference. Mismatches introduce subtle errors. Second, normalize consistently. Either normalize during training or at inference, but not haphazardly. Post-hoc normalization on embeddings not trained for it can distort geometry. Third, do not trust absolute scores. A cosine similarity of 0.85 means nothing in isolation. Scores are model-dependent and dataset-dependent. Focus on ranking metrics like recall. Fourth, use hybrid retrieval. Combine dense vectors with a lexical search, such as BM25. They fail differently: semantic search misses when query phrasing diverges from the training distribution; lexical search misses when the meaning does not share vocabulary. Together, they cover more ground.
Beyond the Primitive
Vector similarity is the primitive, but it is not the ceiling. Graph-based approaches like Microsoft’s GraphRAG4 layer relational structure on top of embeddings: entities connect to other entities, and similarity propagates through those connections. Knowledge graphs encode explicit relationships that pure vector proximity cannot capture. A document mentioning “Apple” near “iPhone” lives in a different part of the graph than one saying “Apple” near “orchard,” even if their embeddings overlap. These approaches do not replace vector similarity; they augment it. Vector search finds candidates fast; graph structure adds context that resolves ambiguity. Hybrid systems combining dense retrieval, sparse retrieval, and graph traversal are becoming the new baseline for production RAG.
Recent research has surfaced deeper issues with the primitive itself. Work from Netflix researchers5 shows that cosine similarity on embeddings trained with specific regularization can yield arbitrary results. The learned representations have degrees of freedom that the training objective does not constrain, and cosine similarity is sensitive to them in ways that dot products are not. Subsequent analysis confirms these findings: when models are trained with dot-product objectives, the scaling of latent dimensions becomes arbitrary, making cosine similarity measurements non-unique or meaningless. The implication is clear: the metric used for inference should align with what the model was optimized for. Blindly using cosine similarity because it is conventional may not be safe.
As RAG becomes infrastructure, these foundational questions grow more pressing. Similarity is not a natural fact we discover. It is a choice we make, encoded in training objectives, embedding architectures, and distance metrics. The difference between a retrieval system that works and one that silently fails often comes down to whether that choice was made with intention or inherited by default.
This is why studying the fundamentals matters. Complex systems like RAG pipelines emerge from simple primitives, such as vector similarity. Understanding the primitive, its assumptions and limitations, is how we gain leverage over the larger system. The math is not hard. The insight is knowing that it matters.
Footnotes
-
The debate intensified after Google Gemini 1.5 Pro introduced a 1M token context window in 2024, with some declaring “RAG is dead.” See RAG vs. Long-Context Models for an overview of the ongoing discussion. ↩
-
Gao, T., Yao, X., & Chen, D. (2021). SimCSE: Simple Contrastive Learning of Sentence Embeddings. EMNLP 2021. This work demonstrates how contrastive objectives produce effective sentence embeddings. ↩
-
Khattab, O., & Zaharia, M. (2020). ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR 2020. ↩
-
Edge, D., et al. (2024). From Local to Global: A Graph RAG Approach to Query-Focused Summarization. Microsoft Research. ↩
-
Steck, H., Ekanadham, C., & Kallus, N. (2024). Is Cosine-Similarity of Embeddings Really About Similarity?. arXiv preprint. ↩