HOW-TO · RAG
How to Implement Metadata Filtering in ChromaDB
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES
ChromaDB with documents containing metadata
What this does
ChromaDB supports filtering retrieved documents by metadata key-value pairs at query time. This guide explains how to structure metadata when adding documents and how to apply filters such as equality, range comparisons, and compound conditions during similarity search.
Steps
Add documents with structured metadata. Use flat key-value pairs with supported types (string, int, float, bool).
import chromadb client = chromadb.PersistentClient(path="./filtered_db") col = client.get_or_create_collection(name="filtered_docs") col.add( ids=["p1", "p2", "p3", "p4"], documents=[ "Deploying ChromaDB on Docker improves scalability.", "Metadata filtering speeds up retrieval in large corpora.", "Ollama supports quantized models for edge devices.", "RAG pipelines reduce hallucinations in LLM outputs." ], metadatas=[ {"category": "infra", "year": 2024, "rating": 4.5}, {"category": "rag", "year": 2024, "rating": 4.8}, {"category": "ai", "year": 2023, "rating": 4.2}, {"category": "rag", "year": 2025, "rating": 4.9} ] ) print("Indexed:", col.count())Filter by exact string match.
results = col.query( query_texts=["retrieval performance"], where={"category": "rag"}, # exact match filter n_results=2 ) for doc, meta in zip(results["documents"][0], results["metadatas"][0]): print(f"[{meta['category']}] {doc}")Filter by numeric range.
results = col.query( query_texts=["reliability and quality"], where={"rating": {"$gte": 4.5}}, # rating >= 4.5 n_results=3 ) for doc, meta in zip(results["documents"][0], results["metadatas"][0]): print(f"[rating: {meta['rating']}] {doc}")Combine multiple filter conditions.
results = col.query( query_texts=["deployment"], where={ "category": "infra", "rating": {"$gte": 4.0} }, n_results=5 ) print("Filtered results:", len(results["documents"][0]))
Verification
python3 -c "
import chromadb
c = chromadb.PersistentClient(path='/tmp/chroma_filter_test')
col = c.get_or_create_collection('filt')
col.add(ids=['a','b'], documents=['doc a','doc b'], metadatas=[{'tag':'x'},{'tag':'y'}])
r = col.query(query_texts=['doc'], where={'tag':'x'}, n_results=1)
print('Filter result:', r['documents'][0])
c.delete_collection('filt')
"
# Expected: Filter result: ['doc a']
Common failures
- Filter key not in metadata. Applying
where={"nonexistent_field": "value"}returns zero results silently, not an error. Always verify the metadata schema when debugging empty results. - Wrong filter operator syntax.
$gte,$gt,$lte,$ltmust be inside a dict for the value, not at the top level. Writing{"rating": 4.5}instead of{"rating": {"$gte": 4.5}}performs equality, not range comparison. - Non-string keys in metadata. Metadata values can be strings, numbers, or bools, but keys must be strings. Passing
{"2024": "value"}as metadata keys causes a type error. - Large result sets with no filter. Omitting
whereon a large collection returns all top-k matches sorted by vector distance, ignoring metadata. Always include explicit filters for scoped queries. - Metadata updated without re-embedding. Modifying metadata via
update_metadatadoes not change the vector. The document stays at the same embedding location; filtering still works correctly.
Related guides
RELATED GUIDES