HOW-TO · RAG
How to Apply Metadata Filters to Reduce Search Space
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES
Vector store with metadata support, Python 3.10+
What this does
Metadata filtering narrows vector search results by pre-filtering on fields like date, category, or source before running the similarity search. This improves relevance and reduces latency by limiting the candidate pool.
Steps
- Index documents with metadata. Attach metadata when adding documents so filters have fields to match against.
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import OllamaEmbeddings
from langchain.schema import Document
embeddings = OllamaEmbeddings(model="nomic-embed-text")
docs = [
Document(page_content="Q4 earnings exceeded expectations.",
metadata={"year": 2025, "quarter": "Q4", "source": "finance"}),
Document(page_content="New product launch in March.",
metadata={"year": 2025, "quarter": "Q1", "source": "product"}),
Document(page_content="Engineering hiring plan for H2.",
metadata={"year": 2025, "quarter": "H2", "source": "hr"}),
]
vectorstore = Chroma.from_documents(docs, embeddings)
- Apply metadata filter during retrieval. Pass a filter dict to
similarity_search.
results = vectorstore.similarity_search(
"What happened in Q4?",
k=3,
filter={"quarter": "Q4"}
)
for r in results:
print(r.page_content, r.metadata)
- Use complex filters with operators. For stores that support it, combine multiple conditions.
# ChromaDB supports $and, $or operators (v0.4+)
filter = {
"$or": [
{"source": "finance"},
{"source": "product"}
],
"year": 2025
}
results = vectorstore.similarity_search("revenue", k=3, filter=filter)
- Use as retriever in a chain. Pass the filter when building the retriever object.
retriever = vectorstore.as_retriever(
search_kwargs={"k": 3, "filter": {"source": "finance"}}
)
Verification
python -c "
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import OllamaEmbeddings
embeddings = OllamaEmbeddings(model='nomic-embed-text')
vs = Chroma(embedding_function=embeddings)
r = vs.similarity_search('test', k=1, filter={'source': 'finance'})
print(f'Results: {len(r)}')
# Expected: Results: <N> (depends on indexed docs)
"
Common failures
- Metadata field mismatch. Filter references a field name that doesn't exist in the stored metadata. Always inspect a sample document first.
- Operator syntax varies by store. ChromaDB uses
$and/$or, while Qdrant usesmust/should. Check your vector store docs. - Empty results from over-filtering. Combining too many filter conditions returns zero matches. Start with one filter and layer incrementally.
- Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
- Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.
Related guides
- How to Build RetrievalQA Chain with Sources
- How to Use Vector Store as Agent Memory