Sparse Retrieval (BM25) — RAG Systems: Part 1 (Chapter 14)

BM25 (Best Matching 25) ranks documents based on term frequency and inverse document frequency. Unlike dense retrieval, BM25 requires exact term matches. This limitation is also its strength for specific query types.

BM25 Algorithm Explained

BM25 scores a document as follows:

score(D, Q) = Î£ IDF(q_i) Ã— (tf(q_i, D) Ã— (k1 + 1)) / (tf(q_i, D) + k1 Ã— (1 - b + b Ã— |D|/avgdl))

Where k1 controls term frequency saturation (typically 1.2-2.1) and b controls document length normalization (typically 0.75). Higher k1 values give more weight to repeated terms. Higher b values penalize longer documents more heavily.

BM25 accounts for term frequency saturation: a term appearing 10 times is not 10x more relevant than appearing once. This prevents common words from dominating scores.

BM25 Implementation with Rank BM25

from rank_bm25 import BM25Okapi

# Tokenize corpus
tokenized_corpus = [doc.split(" ") for doc in document_texts]

# Create BM25 index
bm25 = BM25Okapi(tokenized_corpus)

# Search
query = "authentication token jwt"
tokenized_query = query.split(" ")
scores = bm25.get_scores(tokenized_query)
top_indices = scores.argsort()[-10:][::-1]

# Retrieve top results with scores
results = [
    {"id": doc_ids[i], "text": document_texts[i], "score": scores[i]}
    for i in top_indices
]

For production use, BM25Okapi works well for corpora under 1 million documents. For larger corpora, consider Elasticsearch or OpenSearch which implement BM25 at scale.

Elasticsearch BM25 Setup

from elasticsearch import Elasticsearch

es = Elasticsearch(["http://localhost:9200"])

# Create index with BM25
es.indices.create(
    index="documents",
    body={
        "settings": {
            "analysis": {"analyzer": {"default": {"type": "english"}}}
        },
        "mappings": {
            "properties": {
                "text": {"type": "text", "analyzer": "english"},
                "metadata": {"type": "object"}
            }
        }
    }
)

# Index documents
for doc_id, text in zip(doc_ids, document_texts):
    es.index(index="documents", id=doc_id, body={"text": text})

# Search
response = es.search(
    index="documents",
    body={
        "query": {
            "match": {"text": "python asyncio error handling"}
        },
        "_source": ["text"],
        "size": 10
    }
)

Elasticsearch uses the standard BM25 implementation (Lucene's BM25Similarity) with English analysis including stemming, stop word removal, and synonym handling.

BM25 Failure Cases

BM25 fails when queries and documents use different vocabulary. A user searching "neural network training" returns different results than documents written as "deep learning model training."

BM25 also fails with:

Misspelled queries (without fuzzy matching)
Very short queries (single words lack context)
Concept-based queries ("emotional support during deployment")
Cross-lingual content (English query, French documents)

# BM25 with fuzziness for typo tolerance
response = es.search(
    index="documents",
    body={
        "query": {
            "match": {
                "text": {
                    "query": "authentcation",  # Intentional typo
                    "fuzziness": "AUTO"        # Auto-corrects "authentcation"
                }
            }
        }
    }
)

Fuzzy matching adds recall but reduces precision. A query for "Java" might match "JavaScript" documents.