Large language models
BM25 (Best Matching 25)
BM25 is the canonical sparse-retrieval algorithm: a TF-IDF variant that saturates term frequency (a token appearing 100 times isn't 100× more relevant than once) and normalizes by document length. Default scorer in Lucene, Elasticsearch, OpenSearch, and Tantivy.
BM25 is decades old, requires no training, and runs on any corpus with a tokenizer. For exact-match and rare-vocabulary queries it often beats much-fancier neural retrievers; that's why hybrid retrieval keeps it.
Tunable knobs: k1 (term frequency saturation, typically 1.2–2.0) and b (length normalization, 0.75 default). Domain tuning rarely helps beyond defaults for general corpora.
Related terms
Reviewed by Fredoline Eruo. See our editorial policy.