RAG and Vector Search

Grounding in external knowledge

The Knowledge Problem

Every language model has a fundamental limitation: its knowledge is frozen at training time.

Ask GPT-4 about events from yesterday, and it draws a blank. Ask about your company's internal documentation, and it has no idea. Ask about a recent scientific paper, and it might hallucinate something that sounds plausible but is completely fabricated.

Training happens once; the world keeps changing. The model's weights encode everything it learned during training, but those weights don't update as new information emerges. It's like an expert who read everything up to a certain date, then went into a cave and stopped learning.

The problem compounds with hallucination. When a model doesn't know something, it doesn't say "I don't know." It generates confident-sounding text that may be entirely false. The same capability that lets it write fluently also lets it fabricate fluently. There's no bright line in the model's behavior between "retrieving a fact" and "generating plausible fiction."

Here's the key idea: what if models could look things up?

Instead of relying solely on parametric knowledge (facts encoded in weights), we give the model access to external documents at query time. The model becomes a reasoning engine that can cite sources, rather than a closed system making things up.

This is Retrieval-Augmented Generation—RAG for short. It's become the dominant pattern for building knowledge-intensive AI systems.

Embeddings for Search

To find relevant documents, we need a way to measure meaning. Traditional keyword search looks for exact matches: does this document contain the word "transformer"? But meaning goes deeper than words.

"The attention mechanism allows tokens to interact" and "Tokens communicate through attention" mean the same thing, yet share few words. A user asking about "neural network building blocks" should find documents about "layers and components," even without exact term overlap.

Embedding models solve this by mapping text to vectors. Each document becomes a point in high-dimensional space—typically 768 or 1536 dimensions. The key property: similar meanings map to nearby points.

\text{embedding}(\text{"king"}) - \text{embedding}(\text{"man"}) + \text{embedding}(\text{"woman"}) \approx \text{embedding}(\text{"queen"})

This famous example captures the geometric nature of meaning. Relationships between concepts become directions in space. "King is to man as queen is to woman" becomes a parallel movement along the same vector.

For search, this is transformative. Instead of matching keywords, you embed the query into the same space as your documents, then find the nearest neighbors. The query "how do transformers handle long sequences?" will find documents about "attention mechanisms and context length," even if those exact words don't appear in the query.

Interactive: Vector Search in Embedding Space

Top k = 33

transformers

training

applications

Query

Nearest 3 Documents:

1Self-attention

d=0.14

2Attention mechanism

d=0.72

3Multi-head attention

d=0.76

Drag the query point to search different regions. Documents cluster by topic— semantically similar content lives nearby in embedding space.

Distance in embedding space corresponds to semantic similarity. Cosine similarity—the angle between vectors—is the standard metric:

\text{similarity}(A, B) = \frac{A \cdot B}{\|A\| \|B\|}

Two vectors pointing in the same direction have similarity 1. Orthogonal vectors have similarity 0. Opposite directions give -1. For normalized vectors (unit length), this is equivalent to the dot product.

Modern embedding models are themselves transformers, trained on massive datasets of similar and dissimilar text pairs. They've learned remarkably nuanced representations—distinguishing "bank" (financial) from "bank" (river), understanding that "car" is more similar to "automobile" than to "vehicle," capturing the subtle differences between "happy," "joyful," and "ecstatic."

Vector Databases

Embedding a million documents gives you a million 768-dimensional vectors. To find the 10 nearest neighbors for a query, do you compare against all million?

For small collections, yes—brute force works fine. But at scale (billions of vectors), exhaustive search becomes impossibly slow. Vector databases solve this with specialized index structures.

HNSW (Hierarchical Navigable Small World) is the most popular algorithm. It builds a graph where nearby vectors are connected. To search, you start at a random node and greedily walk toward the query, hopping to whichever neighbor is closest. Multiple layers with different granularities make this efficient.

IVF (Inverted File Index) clusters vectors into groups. At query time, you only search within the nearest clusters, dramatically reducing comparisons. It's less accurate than HNSW but uses less memory.

The key trade-off: approximate nearest neighbors. These algorithms don't guarantee finding the true closest vectors. They find vectors that are probably close, with tunable accuracy. In practice, 95-99% recall is typical—meaning you find 95-99% of the vectors you would have found with exhaustive search, but 100x faster.

Popular vector databases:

Pinecone: Fully managed, serverless scaling
Weaviate: Open source, rich query language
Chroma: Simple, Python-native, great for prototyping
Milvus: High performance, distributed
pgvector: PostgreSQL extension for existing infrastructure

The choice depends on scale, infrastructure preferences, and whether you need additional features like metadata filtering or hybrid search (combining semantic with keyword matching).

The RAG Pipeline

With embeddings and vector storage understood, here's how RAG comes together:

Indexing (offline, once per corpus):

Chunk: Split documents into passages (typically 200-500 tokens). Too short loses context; too long dilutes relevance.
Embed: Pass each chunk through the embedding model to get vectors
Store: Insert vectors and metadata into the vector database

Retrieval (online, per query):

Embed query: Same embedding model transforms the user's question into a vector
Search: Find the k nearest chunks (typically k=5-20)
Rank: Optionally re-rank results with a more expensive cross-encoder model

Generation (online, per query):

Construct prompt: Inject retrieved chunks as context
Generate: LLM answers using the grounded information
Cite: Include source references in the response

Interactive: The Complete RAG Pipeline

Indexing Phase (Offline)

ChunkSplit docs

EmbedVectorize

StoreVector DB

then, at query time

Query Phase (Online)

QueryEmbed question

RetrieveFind nearest

GenerateLLM answers

RAG Pipeline

Click Start to see how documents become retrievable knowledge.

RAG separates indexing (done once) from retrieval (done per query). This lets knowledge update without retraining the model.

The prompt typically looks like:

Context:
[Retrieved chunk 1]
[Retrieved chunk 2]
[Retrieved chunk 3]
 
Question: {user query}
 
Answer based on the context above:

text

The LLM now has two sources of knowledge: its parametric knowledge (from training) and the retrieved context (from your documents). When they conflict, you want it to prefer the context. Instruction tuning and prompt engineering encourage this, though it's not guaranteed.

RAG elegantly solves the knowledge problem. Your documents can update daily, and the next query automatically uses the fresh versions. No retraining required. The model's cutoff date becomes irrelevant—it reads current documents.

RAG vs Fine-Tuning

When should you use RAG, and when should you fine-tune?

RAG excels when:

Knowledge changes frequently (news, prices, inventory)
You need citations and traceability
You have large document collections
Different users need different knowledge
You want to update without retraining

Fine-tuning excels when:

You need to change behavior, not just knowledge
The task requires consistent style or format
Latency is critical (RAG adds retrieval time)
Knowledge is stable and well-defined
You need the model to "just know" without context

Consider a customer support bot. RAG works great for "What's the return policy?"—retrieve the policy document, answer from it. But if you need the bot to maintain a specific tone, handle escalations gracefully, or respond in a particular format, that's behavioral—fine-tuning territory.

The best systems often combine both. Fine-tune for behavior (how to respond), RAG for knowledge (what to say). The fine-tuned model learns to effectively use retrieved context, while RAG provides the actual information.

One subtle point: fine-tuning can actually improve RAG. A model fine-tuned on your domain better understands your vocabulary, generates more relevant queries, and synthesizes retrieved information more coherently. The two approaches enhance each other.

Key Takeaways

LLM knowledge freezes at training time; hallucination fills gaps with confident fabrications
RAG lets models "look things up" by retrieving relevant documents at query time
Embedding models map text to vectors where similar meanings are close together
Vector databases enable fast approximate nearest-neighbor search at scale (HNSW, IVF)
The RAG pipeline: chunk documents, embed, store; then at query time: embed query, retrieve, generate with context
RAG is ideal for dynamic knowledge; fine-tuning is ideal for behavioral changes—often you want both
RAG provides citations and traceability—you can see exactly which documents informed the answer