Filtering and Metadata
Pre-filtering, post-filtering, and hybrid approaches
Real-world search is rarely pure similarity. You want documents from the last 30 days, in English, from a specific tenant. These metadata filters interact with vector search in non-obvious ways. The approach matters for both correctness and performance.
The Filtering Problem
Consider: "Find the 10 most similar documents, but only from tenant_id = 'acme'."
The naive approach: retrieve the top 10 by similarity, then filter. But what if the top 100 results are all from other tenants? You return fewer than 10 results—or none.
Interactive: Filtering approaches
Post-filter: With 30% selectivity, most results pass. Works well here.
Vector databases must balance:
- Correctness: Return exactly k results matching the filter
- Efficiency: Don't scan the entire database
- Quality: Return the most similar results
Pre-filtering vs Post-filtering
Pre-filtering: Apply metadata filters first, then search only within matching vectors.
Pros:
- Guarantees k results (if enough match)
- Semantically correct
Cons:
- Must build filter-specific indices
- Performance degrades with selective filters
Post-filtering: Search all vectors, then filter results.
Pros:
- Simple to implement
- Uses standard vector index
Cons:
- May return fewer than k results
- Wastes computation on filtered-out results
Filter Selectivity
The key factor is selectivity: what fraction of vectors pass the filter?
Interactive: Filter selectivity effects
Category filter
5 categories, ~20% each
Rule of thumb: If selectivity is below 1%, post-filtering becomes impractical. Pre-filtering or partitioning is required.
High selectivity (>10% pass): Post-filtering works well. Retrieve 10k candidates, filter, keep top k.
Low selectivity (<1% pass): Pre-filtering is essential. Post-filtering would require retrieving far too many candidates.
Very low selectivity (<0.01%): Specialized indices needed (e.g., per-tenant partitions).
Hybrid Filtering
Modern vector databases use hybrid approaches:
Hybrid filtering strategies
Adaptive
Choose pre/post based on estimated selectivity
- + Best of both worlds
- + Automatic optimization
- - Requires selectivity estimation
- - More complex implementation
Adaptive strategy: Choose pre- vs post-filtering based on estimated selectivity.
Filtered HNSW: Modify graph traversal to skip non-matching nodes. Requires filter-aware index.
Partitioned indices: Build separate vector indices per partition (e.g., per tenant). Query only relevant partitions.
Bitmap intersection: Maintain bitmaps for metadata values. Intersect with vector search candidates.
Metadata Index Design
Metadata indexing strategies
| Field | Index Type | Cardinality | Operations |
|---|---|---|---|
| tenant_id | Hash Index | Low | = only |
| category | B-tree | Low-Medium | =, IN |
| timestamp | Range Tree | High | <, >, BETWEEN |
| tags | Inverted Index | Medium | CONTAINS |
| price | B-tree | High | <, >, = |
Index selection guide
- Equality filters: Hash or B-tree index
- Range filters: B-tree or specialized range tree
- Set membership: Inverted index for tags/keywords
- High cardinality + equality: Consider partitioning instead
Efficient filtering requires indexing metadata:
Scalar fields (tenant_id, category):
- B-tree or hash indices for equality
- Fast lookup, supports compound filters
Numeric ranges (timestamp, price):
- Range tree or sorted index
- Efficient for "greater than" / "less than"
Text fields (tags, keywords):
- Inverted index
- Supports "contains" and keyword matching
High-cardinality fields:
- Consider partitioning instead of filtering
- Bitmap indices for low cardinality
Common Patterns
Multi-tenant Search
Each tenant's data should be isolated. Options:
- Separate indices per tenant: Strongest isolation, highest overhead
- Partition key in index: Single index, query filtered by tenant
- Post-filter with tenant check: Simple but wastes computation
Choose based on tenant count and size distribution.
Time-windowed Search
"Documents from the last 7 days" is common. Options:
- Timestamp filter: Works if selectivity is reasonable
- Rolling partitions: Daily/weekly indices, query recent partitions
- TTL with index rebuild: Automatically expire old vectors
Category Filtering
"Only category = 'electronics'" with dozens of categories. Options:
- Pre-filter with category bitmap: Fast for any selectivity
- Per-category indices: If categories are large and stable
- Hybrid: Use vector search, filter, oversample if needed
Performance Considerations
Filter pushdown: Apply filters as early as possible in the query plan.
Index intersection: Combine vector search candidates with metadata filter results.
Caching: Cache frequent filter results (tenant bitmaps, category lists).
Partition pruning: Skip entire index partitions that cannot match.
Over-fetching: Retrieve more candidates than needed to ensure k results after filtering.
When to Partition vs Filter
| Factor | Partition | Filter |
|---|---|---|
| Cardinality | Low (<100) | Any |
| Data size per partition | Large | Any |
| Query patterns | Usually single partition | Mixed |
| Update frequency | Stable | Frequent changes |
| Isolation requirements | Strict (multi-tenant) | Relaxed |
Partitioning adds complexity but enables:
- Independent scaling per partition
- Strict data isolation
- Simplified filtering (no runtime check)
Key Takeaways
- Post-filtering is simple but fails with selective filters (returns fewer than k results)
- Pre-filtering guarantees k results but requires filter-aware indices
- Filter selectivity determines which approach works: high selectivity → post-filter, low → pre-filter
- Hybrid approaches adapt based on estimated selectivity
- Index metadata fields appropriately: B-trees for scalars, range trees for numbers, inverted indices for text
- Consider partitioning for multi-tenant or time-series data with clear access patterns
- Over-fetch to compensate for post-filtering; push filters down early for performance