Hybrid Search (Dense + Sparse) — RAG Systems: Part 2 (Chapter 8)

Dense retrieval captures semantic meaning but misses exact keyword matches. Sparse retrieval (BM25) excels at exact matching but ignores semantic relationships. Hybrid search combines both to leverage their complementary strengths.

The Complementary Strengths Problem

Dense embeddings are effective but imperfect. They struggle with:

Exact terminology matches ("ICD-12 code" vs "medical billing code")
Product names, proper nouns, and domain-specific jargon
Numerical precision ("within 3 business days" vs "within 5 business days")
Negation ("NOT covered" vs "covered")

Sparse retrieval (BM25) is based on term frequency statistics:

score(D, Q) = Σ IDF(term) × (term_frequency_in_D × (k1 + 1)) / 
                        (term_frequency_in_D + k1 × (1 - b + b × |D|/avgdl))

Where k1 controls term frequency saturation, b controls document length normalization.

Sparse retrieval on its own fails when queries use synonyms ("vehicle" vs "car" vs "automobile") or when semantic understanding is needed.

Implementing Hybrid Search

The standard hybrid search architecture:

Query
  ├── Embed Query → Dense Retrieval → Dense Scores
  └── BM25 Scoring → Sparse Scores
           ↓
    Combine Scores (RRF or weighted)
           ↓
       Final Ranking

from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import numpy as np

class HybridRetriever:
    def __init__(self, documents, dense_model='sentence-transformers/all-MiniLM-L6-v2'):
        self.documents = documents
        self.contents = [doc['content'] for doc in documents]
        
        # Setup sparse retrieval (BM25)
        tokenized_corpus = [doc.split() for doc in self.contents]
        self.bm25 = BM25Okapi(tokenized_corpus)
        
        # Setup dense retrieval
        self.dense_model = SentenceTransformer(dense_model)
        self.dense_embeddings = self.dense_model.encode(self.contents)
    
    def retrieve(self, query, k=20, alpha=0.5):
        """
        Hybrid retrieval combining dense and sparse.
        
        Args:
            query: Query string
            k: Number of results to return
            alpha: Weight for dense vs sparse (0=all sparse, 1=all dense)
        """
        # Dense retrieval
        query_embedding = self.dense_model.encode([query])
        dense_scores = self._cosine_similarity(query_embedding, self.dense_embeddings)
        
        # Sparse retrieval
        tokenized_query = query.split()
        sparse_scores = self.bm25.get_scores(tokenized_query)
        sparse_scores = self._normalize(sparse_scores)
        
        # Combine scores
        combined_scores = alpha * dense_scores + (1 - alpha) * sparse_scores
        
        # Get top-k results
        top_indices = np.argsort(combined_scores)[::-1][:k]
        
        return [
            {
                'document': self.documents[i]['content'],
                'score': combined_scores[i],
                'dense_score': dense_scores[i],
                'sparse_score': sparse_scores[i],
                'metadata': self.documents[i].get('metadata', {})
            }
            for i in top_indices
        ]
    
    def _cosine_similarity(self, query_vec, doc_vecs):
        """Compute cosine similarity between query and documents."""
        similarities = np.dot(doc_vecs, query_vec.T).flatten()
        norms = np.linalg.norm(doc_vecs, axis=1) * np.linalg.norm(query_vec)
        return similarities / norms
    
    def _normalize(self, scores):
        """Min-max normalize scores to [0, 1]."""
        if scores.max() == scores.min():
            return np.ones_like(scores) * 0.5
        return (scores - scores.min()) / (scores.max() - scores.min())

Weighting Strategies

Alpha controls dense vs. sparse contribution. The optimal value depends on your use case:

Alpha near 1.0 (0.7-0.9): Dense dominant. Best when queries use synonyms, when documents use precise technical language, when semantic understanding matters.
Alpha near 0.5 (0.4-0.6): Balanced. Good default starting point. Many applications perform well in this range.
Alpha near 0.0 (0.1-0.3): Sparse dominant. Best when queries contain exact terminology, product codes, proper nouns, or when LLM context would help with semantic gaps anyway.

The correct approach is to tune alpha on your evaluation set:

def tune_alpha(documents, queries, relevant_labels, alpha_values=[0.0, 0.1, 0.3, 0.5, 0.7, 0.9, 1.0]):
    """Find optimal alpha through grid search."""
    retriever = HybridRetriever(documents)
    results = {}
    
    for alpha in alpha_values:
        aggregate_recall = 0
        for query, relevant_docs in zip(queries, relevant_labels):
            retrieved = retriever.retrieve(query, k=50, alpha=alpha)
            retrieved_ids = [doc['metadata'].get('doc_id') for doc in retrieved]
            
            # Calculate recall for this query
            recall = len(set(retrieved_ids) & set(relevant_docs)) / len(relevant_docs)
            aggregate_recall += recall
        
        results[alpha] = aggregate_recall / len(queries)
    
    return results

Elasticsearch and Weaviate Implementations

Production vector databases often include native hybrid search:

# Elasticsearch with hybrid search
from elasticsearch import Elasticsearch

es = Elasticsearch(['http://localhost:9200'])

def es_hybrid_search(query, index_name, k=20, sparse_weight=0.5, dense_weight=0.5):
    """
    Elasticsearch native hybrid search usingbm25 and knn.
    """
    response = es.search(
        index=index_name,
        query={
            "bool": {
                "should": [
                    {"match": {"content": query}}  # Sparse component
                ]
            }
        },
        knn={
            "field": "embedding",
            "query_vector": embed_query(query),
            "k": 50,
            "num_candidates": 100
        },
        weight={
            "RRF": {
                "window_size": k,
                "rank_constant": 60
            }
        },
        size=k
    )
    
    return [
        {
            'document': hit['_source']['content'],
            'score': hit['_score'],
            'metadata': hit['_source'].get('metadata', {})
        }
        for hit in response['hits']['hits']
    ]