KEY INSIGHT
Pre-filter documents by metadata before similarity search to scope results to relevant subsets.
ChromaDB supports `where` filtering to restrict queries to documents matching specific metadata criteria. The filter runs before similarity search, narrowing the candidate set.
```python
import chromadb
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(
name="knowledge_base",
embedding_function=SentenceTransformer('all-MiniLM-L6-v2')
)
# Add documents with various metadata
collection.add(
documents=[
"How to install Python 3.11 on Ubuntu",
"Python installation guide for Windows",
"Docker container setup tutorial",
"Kubernetes deployment best practices",
"React component lifecycle explained",
"Building REST APIs with FastAPI"
],
ids=["p1", "p2", "d1", "k1", "r1", "f1"],
metadatas=[
{"category": "python", "difficulty": "beginner", "rating": 4.5},
{"category": "python", "difficulty": "beginner", "rating": 4.2},
{"category": "devops", "difficulty": "intermediate", "rating": 4.8},
{"category": "devops", "difficulty": "advanced", "rating": 4.6},
{"category": "frontend", "difficulty": "intermediate", "rating": 4.3},
{"category": "backend", "difficulty": "intermediate", "rating": 4.7}
]
)
# Filter by single metadata field
results = collection.query(
query_texts=["containers and deployment"],
n_results=3,
where={"category": "devops"} # Only search devops documents
)
print("DevOps results:")
for doc in results['documents'][0]:
print(f" - {doc}")
```
Output:
```
DevOps results:
- Docker container setup tutorial
- Kubernetes deployment best practices
```
Compound filters use operators `$eq`, `$ne`, `$gt`, `$gte`, `$lt`, `$lte`:
```python
# Filter by category AND difficulty
results = collection.query(
query_texts=["programming tutorials"],
n_results=3,
where={
"category": "python",
"difficulty": {"$gte": "intermediate"} # difficulty >= "intermediate"
}
)
# Filter with OR logic using $or
results = collection.query(
query_texts=["tutorials"],
n_results=5,
where={
"$or": [
{"category": {"$eq": "python"}},
{"category": {"$eq": "frontend"}}
]
}
)
```
Metadata filtering is effective but has limits. ChromaDB loads all matching documents into memory before vector search. For large-scale filtering (millions of documents), consider segmenting into separate collections per category.