08. Hybrid Search (Dense + Sparse)
Dense retrieval captures semantic meaning but misses exact keyword matches. Sparse retrieval (BM25) excels at exact matching but ignores semantic relationships. Hybrid search combines both to leverage their complementary strengths.
The Complementary Strengths Problem
Dense embeddings are effective but imperfect. They struggle with:
- Exact terminology matches ("ICD-12 code" vs "medical billing code")
- Product names, proper nouns, and domain-specific jargon
- Numerical precision ("within 3 business days" vs "within 5 business days")
- Negation ("NOT covered" vs "covered")
Sparse retrieval (BM25) is based on term frequency statistics:
score(D, Q) = Σ IDF(term) × (term_frequency_in_D × (k1 + 1)) /
(term_frequency_in_D + k1 × (1 - b + b × |D|/avgdl))
Where k1 controls term frequency saturation, b controls document length normalization.
Sparse retrieval on its own fails when queries use synonyms ("vehicle" vs "car" vs "automobile") or when semantic understanding is needed.
Implementing Hybrid Search
The standard hybrid search architecture:
Query
├── Embed Query → Dense Retrieval → Dense Scores
└── BM25 Scoring → Sparse Scores
↓
Combine Scores (RRF or weighted)
↓
Final Ranking
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import numpy as np
class HybridRetriever:
def __init__(self, documents, dense_model='sentence-transformers/all-MiniLM-L6-v2'):
self.documents = documents
self.contents = [doc['content'] for doc in documents]
# Setup sparse retrieval (BM25)
tokenized_corpus = [doc.split() for doc in self.contents]
self.bm25 = BM25Okapi(tokenized_corpus)
# Setup dense retrieval
self.dense_model = SentenceTransformer(dense_model)
self.dense_embeddings = self.dense_model.encode(self.contents)
def retrieve(self, query, k=20, alpha=0.5):
"""
Hybrid retrieval combining dense and sparse.
Args:
query: Query string
k: Number of results to return
alpha: Weight for dense vs sparse (0=all sparse, 1=all dense)
"""
# Dense retrieval
query_embedding = self.dense_model.encode([query])
dense_scores = self._cosine_similarity(query_embedding, self.dense_embeddings)
# Sparse retrieval
tokenized_query = query.split()
sparse_scores = self.bm25.get_scores(tokenized_query)
sparse_scores = self._normalize(sparse_scores)
# Combine scores
combined_scores = alpha * dense_scores + (1 - alpha) * sparse_scores
# Get top-k results
top_indices = np.argsort(combined_scores)[::-1][:k]
return [
{
'document': self.documents[i]['content'],
'score': combined_scores[i],
'dense_score': dense_scores[i],
'sparse_score': sparse_scores[i],
'metadata': self.documents[i].get('metadata', {})
}
for i in top_indices
]
def _cosine_similarity(self, query_vec, doc_vecs):
"""Compute cosine similarity between query and documents."""
similarities = np.dot(doc_vecs, query_vec.T).flatten()
norms = np.linalg.norm(doc_vecs, axis=1) * np.linalg.norm(query_vec)
return similarities / norms
def _normalize(self, scores):
"""Min-max normalize scores to [0, 1]."""
if scores.max() == scores.min():
return np.ones_like(scores) * 0.5
return (scores - scores.min()) / (scores.max() - scores.min())
Weighting Strategies
Alpha controls dense vs. sparse contribution. The optimal value depends on your use case:
Alpha near 1.0 (0.7-0.9): Dense dominant. Best when queries use synonyms, when documents use precise technical language, when semantic understanding matters.
Alpha near 0.5 (0.4-0.6): Balanced. Good default starting point. Many applications perform well in this range.
Alpha near 0.0 (0.1-0.3): Sparse dominant. Best when queries contain exact terminology, product codes, proper nouns, or when LLM context would help with semantic gaps anyway.
The correct approach is to tune alpha on your evaluation set:
def tune_alpha(documents, queries, relevant_labels, alpha_values=[0.0, 0.1, 0.3, 0.5, 0.7, 0.9, 1.0]):
"""Find optimal alpha through grid search."""
retriever = HybridRetriever(documents)
results = {}
for alpha in alpha_values:
aggregate_recall = 0
for query, relevant_docs in zip(queries, relevant_labels):
retrieved = retriever.retrieve(query, k=50, alpha=alpha)
retrieved_ids = [doc['metadata'].get('doc_id') for doc in retrieved]
# Calculate recall for this query
recall = len(set(retrieved_ids) & set(relevant_docs)) / len(relevant_docs)
aggregate_recall += recall
results[alpha] = aggregate_recall / len(queries)
return results
Elasticsearch and Weaviate Implementations
Production vector databases often include native hybrid search:
# Elasticsearch with hybrid search
from elasticsearch import Elasticsearch
es = Elasticsearch(['http://localhost:9200'])
def es_hybrid_search(query, index_name, k=20, sparse_weight=0.5, dense_weight=0.5):
"""
Elasticsearch native hybrid search usingbm25 and knn.
"""
response = es.search(
index=index_name,
query={
"bool": {
"should": [
{"match": {"content": query}} # Sparse component
]
}
},
knn={
"field": "embedding",
"query_vector": embed_query(query),
"k": 50,
"num_candidates": 100
},
weight={
"RRF": {
"window_size": k,
"rank_constant": 60
}
},
size=k
)
return [
{
'document': hit['_source']['content'],
'score': hit['_score'],
'metadata': hit['_source'].get('metadata', {})
}
for hit in response['hits']['hits']
]
Implement hybrid search using RRF on your document collection. Compare precision at k=10 for dense-only, sparse-only, and hybrid. Vary the alpha parameter from 0 to 1 in 0.2 increments and identify the optimal blend.