Quantization

Product quantization and how to compress vectors 32x without losing accuracy

A 768-dimensional float32 embedding uses 3KB of memory. At a billion vectors, that is 3TB. Memory is expensive, and cache efficiency matters for speed. Quantization compresses vectors, trading precision for space and often improving performance.

Scalar Quantization

The simplest approach: reduce precision per dimension.

Interactive: Scalar quantization

Original
0.7342
32 bits
Quantized
0.7342
32 bits
Error
0.000000
0% compression
Per-dimension memory
32/32 bits

float32 → float16: 2x compression, minimal accuracy loss. Widely supported in GPUs and modern CPUs.

float32 → int8: 4x compression. Quantize each dimension to 8 bits using learned scale and zero-point. Accuracy loss is small for well-trained embeddings.

float32 → binary: 32x compression. Each dimension becomes 1 bit (sign only). Significant accuracy loss, but enables extremely fast Hamming distance computations.

Scalar quantization is orthogonal to index structure—you can combine it with HNSW or any other algorithm.

Product Quantization

Product quantization (PQ) achieves higher compression by exploiting structure in the embedding space.

Interactive: Product quantization

Vector decomposition
96d
96d
96d
96d
96d
96d
96d
96d
8 subspaces × 96 dimensions each
Original size
3,072 bytes
Compressed size
8 bytes
Compression ratio
384.0x
How it works: Each subspace gets a codebook of 256 centroids. Each subvector is replaced by its nearest centroid's index (8 bits).

The idea:

  1. Split the 768-dimensional vector into M subvectors (e.g., 96 subvectors of 8 dimensions each)
  2. For each subvector, learn a codebook of K centroids (e.g., K=256)
  3. Replace each subvector with the index of its nearest centroid

With M=96 subspaces and K=256 centroids:

  • Original size: 768 × 4 = 3072 bytes
  • Compressed size: 96 × 1 = 96 bytes (each index fits in 1 byte)
  • Compression ratio: 32x

At query time, distances are computed approximately using precomputed distance tables.

How PQ Distance Computation Works

For each query:

  1. Compute distances from the query's subvectors to all centroids in each codebook
  2. Store these in a lookup table (M × K values)
  3. For each database vector, sum up the precomputed distances for its centroid indices

This is surprisingly fast:

  • The lookup table fits in L1 cache
  • Distance computation becomes M table lookups and additions
  • SIMD can parallelize across database vectors

IVF-PQ: Combining Clustering with Quantization

Most vector databases use IVF-PQ:

  1. Coarse quantization (IVF): Cluster vectors into buckets. At query time, search only nearby buckets.

  2. Fine quantization (PQ): Within each bucket, store PQ-compressed residuals (the difference between each vector and its bucket centroid).

This provides both fast candidate filtering (IVF) and memory efficiency (PQ).

Optimized Product Quantization (OPQ)

Standard PQ assumes the dimensions are independent. In practice, they are correlated. OPQ learns a rotation matrix to decorrelate dimensions before quantization.

This improves accuracy at no additional storage cost—the rotation is applied at query time.

Trade-offs

Interactive: Quantization trade-offs

Recall@10
97.0%
Memory (1B vecs)
750 GB
Search speedup
1.5x
Recall
Memory
MethodCompressionRecall LossUse Case
float321x0%Baseline, small datasets
float162x<1%GPU inference
int84x1-3%CPU inference, moderate scale
PQ (M=96)32x5-10%Billion-scale datasets
Binary32x15-25%Extreme scale, first-pass filtering

The choice depends on your scale and accuracy requirements.

Quantization-Aware Retrieval

Modern approaches train embedding models with quantization in mind:

Straight-through estimators: Backpropagate through quantization during training.

Matryoshka embeddings: Train embeddings where truncated versions (first 256 dims, first 128, etc.) remain effective. Use full embeddings for reranking.

Binary embedding models: Train specifically for binary quantization, optimizing for Hamming distance.

Compression Comparison

Compare compression methods

MethodCompressionMemory (1B vecs)Recall Loss
float321x3.0 TB~0%
float162x1.5 TB~0.5%
int84x750 GB~3%
PQ-9632x94 GB~8%
Binary32x94 GB~20%

Pro tip: Use PQ for first-pass retrieval, then rerank top-100 against full-precision vectors. This recovers most of the recall loss while keeping memory efficient.

For 1 billion 768-dimensional vectors:

MethodMemoryRecall@10
float323 TB100%
float161.5 TB99.5%
int8750 GB97%
PQ (32x)94 GB92%
PQ + rerank94 GB + 300 GB98%

The PQ + rerank pattern: store PQ-compressed vectors for fast first-pass search, then rerank top candidates against full-precision vectors stored on disk.

Practical Recommendations

  1. Start with int8: 4x compression with minimal accuracy loss. Widely supported.

  2. Use PQ for billions of vectors: When memory is the constraint, PQ enables billion-scale search on commodity hardware.

  3. Combine with reranking: Retrieve 100 candidates with PQ, rerank with full-precision vectors for the final top 10.

  4. Benchmark on your data: Compression effects vary by embedding model and domain.

  5. Consider Matryoshka embeddings: If your model supports it, truncation provides smooth compression/accuracy trade-off.

Key Takeaways

  • Scalar quantization reduces precision per dimension: float32 → int8 gives 4x compression
  • Product quantization splits vectors into subvectors, replaces each with a codebook index: up to 32x compression
  • IVF-PQ combines clustering (fast candidate filtering) with PQ (memory efficiency)
  • Trade-off: higher compression → lower recall → faster search
  • PQ + reranking recovers accuracy: fast first-pass with compressed vectors, precise rerank with full vectors
  • Modern embeddings are trained with quantization in mind (Matryoshka, binary-aware)