HOW-TO · RAG

How to Implement Metadata Filtering in ChromaDB

intermediate15 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

ChromaDB with documents containing metadata

What this does

ChromaDB supports filtering retrieved documents by metadata key-value pairs at query time. This guide explains how to structure metadata when adding documents and how to apply filters such as equality, range comparisons, and compound conditions during similarity search.

Steps

  1. Add documents with structured metadata. Use flat key-value pairs with supported types (string, int, float, bool).

    import chromadb
    
    client = chromadb.PersistentClient(path="./filtered_db")
    col = client.get_or_create_collection(name="filtered_docs")
    
    col.add(
        ids=["p1", "p2", "p3", "p4"],
        documents=[
            "Deploying ChromaDB on Docker improves scalability.",
            "Metadata filtering speeds up retrieval in large corpora.",
            "Ollama supports quantized models for edge devices.",
            "RAG pipelines reduce hallucinations in LLM outputs."
        ],
        metadatas=[
            {"category": "infra", "year": 2024, "rating": 4.5},
            {"category": "rag", "year": 2024, "rating": 4.8},
            {"category": "ai", "year": 2023, "rating": 4.2},
            {"category": "rag", "year": 2025, "rating": 4.9}
        ]
    )
    print("Indexed:", col.count())
    
  2. Filter by exact string match.

    results = col.query(
        query_texts=["retrieval performance"],
        where={"category": "rag"},  # exact match filter
        n_results=2
    )
    for doc, meta in zip(results["documents"][0], results["metadatas"][0]):
        print(f"[{meta['category']}] {doc}")
    
  3. Filter by numeric range.

    results = col.query(
        query_texts=["reliability and quality"],
        where={"rating": {"$gte": 4.5}},  # rating >= 4.5
        n_results=3
    )
    for doc, meta in zip(results["documents"][0], results["metadatas"][0]):
        print(f"[rating: {meta['rating']}] {doc}")
    
  4. Combine multiple filter conditions.

    results = col.query(
        query_texts=["deployment"],
        where={
            "category": "infra",
            "rating": {"$gte": 4.0}
        },
        n_results=5
    )
    print("Filtered results:", len(results["documents"][0]))
    

Verification

python3 -c "
import chromadb
c = chromadb.PersistentClient(path='/tmp/chroma_filter_test')
col = c.get_or_create_collection('filt')
col.add(ids=['a','b'], documents=['doc a','doc b'], metadatas=[{'tag':'x'},{'tag':'y'}])
r = col.query(query_texts=['doc'], where={'tag':'x'}, n_results=1)
print('Filter result:', r['documents'][0])
c.delete_collection('filt')
"
# Expected: Filter result: ['doc a']

Common failures

  • Filter key not in metadata. Applying where={"nonexistent_field": "value"} returns zero results silently, not an error. Always verify the metadata schema when debugging empty results.
  • Wrong filter operator syntax. $gte, $gt, $lte, $lt must be inside a dict for the value, not at the top level. Writing {"rating": 4.5} instead of {"rating": {"$gte": 4.5}} performs equality, not range comparison.
  • Non-string keys in metadata. Metadata values can be strings, numbers, or bools, but keys must be strings. Passing {"2024": "value"} as metadata keys causes a type error.
  • Large result sets with no filter. Omitting where on a large collection returns all top-k matches sorted by vector distance, ignoring metadata. Always include explicit filters for scoped queries.
  • Metadata updated without re-embedding. Modifying metadata via update_metadata does not change the vector. The document stays at the same embedding location; filtering still works correctly.

Related guides

RELATED GUIDES