Filtering and Metadata

Pre-filtering, post-filtering, and hybrid approaches

Real-world search is rarely pure similarity. You want documents from the last 30 days, in English, from a specific tenant. These metadata filters interact with vector search in non-obvious ways. The approach matters for both correctness and performance.

The Filtering Problem

Consider: "Find the 10 most similar documents, but only from tenant_id = 'acme'."

The naive approach: retrieve the top 10 by similarity, then filter. But what if the top 100 results are all from other tenants? You return fewer than 10 results—or none.

Interactive: Filtering approaches

Filter selectivity: 30% match0.3

Vectors searched

Results returned

10/10

Wasted computation

Post-filter: With 30% selectivity, most results pass. Works well here.

Vector databases must balance:

Correctness: Return exactly k results matching the filter
Efficiency: Don't scan the entire database
Quality: Return the most similar results

Pre-filtering vs Post-filtering

Pre-filtering: Apply metadata filters first, then search only within matching vectors.

Pros:

Guarantees k results (if enough match)
Semantically correct

Cons:

Must build filter-specific indices
Performance degrades with selective filters

Post-filtering: Search all vectors, then filter results.

Pros:

Simple to implement
Uses standard vector index

Cons:

May return fewer than k results
Wastes computation on filtered-out results

Filter Selectivity

The key factor is selectivity: what fraction of vectors pass the filter?

Interactive: Filter selectivity effects

Category filter

5 categories, ~20% each

Selectivity

20.00%

Post-filter fetch needed

50 vectors

Feasible

Rule of thumb: If selectivity is below 1%, post-filtering becomes impractical. Pre-filtering or partitioning is required.

High selectivity (>10% pass): Post-filtering works well. Retrieve 10k candidates, filter, keep top k.

Low selectivity (<1% pass): Pre-filtering is essential. Post-filtering would require retrieving far too many candidates.

Very low selectivity (<0.01%): Specialized indices needed (e.g., per-tenant partitions).

Hybrid Filtering

Modern vector databases use hybrid approaches:

Hybrid filtering strategies

Adaptive

Choose pre/post based on estimated selectivity

Pros

+ Best of both worlds
+ Automatic optimization

Cons

- Requires selectivity estimation
- More complex implementation

Adaptive strategy: Choose pre- vs post-filtering based on estimated selectivity.

Filtered HNSW: Modify graph traversal to skip non-matching nodes. Requires filter-aware index.

Partitioned indices: Build separate vector indices per partition (e.g., per tenant). Query only relevant partitions.

Bitmap intersection: Maintain bitmaps for metadata values. Intersect with vector search candidates.

Metadata Index Design

Metadata indexing strategies

Field	Index Type	Cardinality	Operations
tenant_id	Hash Index	Low	= only
category	B-tree	Low-Medium	=, IN
timestamp	Range Tree	High	<, >, BETWEEN
tags	Inverted Index	Medium	CONTAINS
price	B-tree	High	<, >, =

Index selection guide

Equality filters: Hash or B-tree index
Range filters: B-tree or specialized range tree
Set membership: Inverted index for tags/keywords
High cardinality + equality: Consider partitioning instead

Efficient filtering requires indexing metadata:

Scalar fields (tenant_id, category):

B-tree or hash indices for equality
Fast lookup, supports compound filters

Numeric ranges (timestamp, price):

Range tree or sorted index
Efficient for "greater than" / "less than"

Text fields (tags, keywords):

Inverted index
Supports "contains" and keyword matching

High-cardinality fields:

Consider partitioning instead of filtering
Bitmap indices for low cardinality

Common Patterns

Multi-tenant Search

Each tenant's data should be isolated. Options:

Separate indices per tenant: Strongest isolation, highest overhead
Partition key in index: Single index, query filtered by tenant
Post-filter with tenant check: Simple but wastes computation

Choose based on tenant count and size distribution.

Time-windowed Search

"Documents from the last 7 days" is common. Options:

Timestamp filter: Works if selectivity is reasonable
Rolling partitions: Daily/weekly indices, query recent partitions
TTL with index rebuild: Automatically expire old vectors

Category Filtering

"Only category = 'electronics'" with dozens of categories. Options:

Pre-filter with category bitmap: Fast for any selectivity
Per-category indices: If categories are large and stable
Hybrid: Use vector search, filter, oversample if needed

Performance Considerations

Filter pushdown: Apply filters as early as possible in the query plan.

Index intersection: Combine vector search candidates with metadata filter results.

Caching: Cache frequent filter results (tenant bitmaps, category lists).

Partition pruning: Skip entire index partitions that cannot match.

Over-fetching: Retrieve more candidates than needed to ensure k results after filtering.

When to Partition vs Filter

Factor	Partition	Filter
Cardinality	Low (<100)	Any
Data size per partition	Large	Any
Query patterns	Usually single partition	Mixed
Update frequency	Stable	Frequent changes
Isolation requirements	Strict (multi-tenant)	Relaxed

Partitioning adds complexity but enables:

Independent scaling per partition
Strict data isolation
Simplified filtering (no runtime check)

Key Takeaways

Post-filtering is simple but fails with selective filters (returns fewer than k results)
Pre-filtering guarantees k results but requires filter-aware indices
Filter selectivity determines which approach works: high selectivity → post-filter, low → pre-filter
Hybrid approaches adapt based on estimated selectivity
Index metadata fields appropriately: B-trees for scalars, range trees for numbers, inverted indices for text
Consider partitioning for multi-tenant or time-series data with clear access patterns
Over-fetch to compensate for post-filtering; push filters down early for performance