BM25 - NFHN Reader

There is a particular kind of respect reserved in engineering for the algorithm that outlives its era. BM25 is one of them. BM25 was born out of information retrieval research in the 1970s and 1980s, polished over decades, and eventually adopted as the default ranking function in Elasticsearch, Solr, and Lucene.

What makes BM25 worth understanding is not just that it works. It is that it works for knowable reasons.

Every part of the formula has a clear interpretation. When a result is surprising, you can trace why. When you need to tune for your domain, the parameters give you meaningful handles to turn. The interpretability is genuinely valuable.

In this write-up, I cover BM25 from first principles - where it came from, why TF-IDF was not enough, how the formula works mechanically, how the parameters behave in practice, what its real limitations are, and where it fits in a modern retrieval stack.

What BM25 Was Built To Solve

The simplest possible retrieval system is Boolean keyword matching: a document is relevant if it contains the query terms, and irrelevant if it does not. This works when a corpus is small, and queries are exact, but it collapses immediately on anything that deviates from this.

For example, every document containing “database” matches equally for the query “fast database.” You have no ranking, no way to distinguish a paper about database internals from a blog post where “database” appears once in a sidebar.

The natural next step is TF-IDF, which most engineers encounter first. TF-IDF scores a document by multiplying two quantities:

Term Frequency (TF): how many times the query term appears in the document
Inverse Document Frequency (IDF): a measure of how rare the term is across the corpus

The intuition is sound. A document that mentions “photosynthesis” ten times is probably more about photosynthesis than one that mentions it once. And a term that appears in every document (like “the”) tells you nothing about relevance, so you discount it with IDF.

TF-IDF works surprisingly well for a heuristic, which is why it survived in production systems for decades. But it has two fundamental failure modes that compound badly in real corpora:

TF is Linear

A document mentioning “photosynthesis” 200 times scores exactly twice as high as one mentioning it 100 times. But is a document twice as relevant just because it repeats the term more? In most cases, no. After a term appears enough times to establish that the document is about that concept, additional occurrences contribute diminishing information about relevance. TF-IDF does not model this.

The second is that TF-IDF has no concept of document length. A short, focused abstract mentioning “photosynthesis” three times is competing on equal footing, under raw TF, with a 10,000-word textbook chapter that mentions it fifteen times. The textbook chapter will almost always win on TF, but that may not reflect relevance. Long documents naturally accumulate more term occurrences just by being long, not because they are more relevant.

Okapi and the TREC years

BM25 emerged from work done on the Okapi system at City University London. The name Okapi BM25 reflects this lineage: Okapi was the system, and BM stands for “Best Match.” The 25 denotes a specific iteration in the development of Best Match functions, which had been evolving through a series of numbered variants.

The shift from theoretical standard to industry default happened more slowly. Lucene, the search library underlying both Elasticsearch and Solr, shipped with a modified TF-IDF implementation for years. Lucene 6 switched BM25 to the default similarity function around 2016, and Elasticsearch 5.0 followed suit. At that point, BM25 became the de facto relevance algorithm for most production search deployments in the world.

How BM25 Works

Rather than presenting the formula and explaining it, it is more useful to build it up from the two problems TF-IDF could not solve. That way, the formula reads as a series of deliberate design decisions rather than a pile of notation.

Saturating Term Frequency

The core insight is that the relationship between term frequency and relevance should not be linear. It should saturate. The first few occurrences of a term in a document are strong evidence of relevance. After that, each additional occurrence contributes less. Eventually, adding more occurrences should contribute almost nothing.

BM25 achieves this with the following transformation of raw term frequency f:

f * (k1 + 1)
───────────────
f + k1

Where k1 is a free parameter. When f is 0, the numerator is 0. As f grows, the expression approaches an asymptote of k1 + 1. The curve rises steeply at first and then flattens. This is the saturation function.

The parameter k1 controls how quickly the saturation occurs. With a low k1 (say, 0.5), the function saturates quickly, and the first occurrence of a term does most of the work. With a higher k1 (say, 2.0), the function saturates slowly, and multiple occurrences continue to add meaningful score. For most text collections, k1 values between 1.2 and 2.0 work well. Elasticsearch defaults to 1.2.

To see why this matters in practice: imagine searching for “search engine” across a corpus. A document that uses the phrase once in a focused technical definition is probably more relevant than a marketing page that repeats “search engine” forty times across boilerplate text. The saturation function gives the first document a fighting chance.

Normalizing Document Length

The second fix is normalizing for document length. The idea is that a term occurring three times in a 300-word document is a stronger relevance signal than the same term occurring three times in a 10,000-word book chapter.

BM25 incorporates document length by adjusting the effective term frequency based on how long the current document is relative to the average document length in the corpus:

f * (k1 + 1)
──────────────────────────────────────────────────
f + k1 * (1 - b + b * (|D| / avgdl))

Where |D| is the length of the current document in tokens, avgdl is the average document length across the corpus, and b is a second free parameter that controls how aggressively length normalization is applied.

When b = 0, the denominator reduces to f + k1, and length normalization is disabled entirely. The score depends only on term frequency. When b = 1, full-length normalization is applied: the effective term frequency is scaled proportionally to the ratio of the document’s length to the average. The standard default of b = 0.75 applies partial normalization, which works well for most corpora.

The practical effect: if you have two documents that both mention your query term three times, the shorter one will score higher. This is usually what you want when documents vary significantly in length.

The IDF Component

BM25 keeps the inverse document frequency concept from TF-IDF, but uses a specific formula derived from the probabilistic relevance framework:

IDF(q) = ln( (N - n(q) + 0.5) / (n(q) + 0.5) )

Here’s my write-up covering the intuition behind IDF and how it works.

Where N is the total number of documents in the corpus and n(q) is the number of documents containing the query term.

The smoothing constants (+0.5) prevent division by zero and handle edge cases. There is also a subtle problem: terms that appear in more than half the corpus produce a negative IDF under this formula. Lucene’s implementation adds 1 inside the log to prevent negative IDF values from inverting the scoring of common terms.

Compared to classic TF-IDF’s log(N / n(q)), BM25’s IDF is derived from a log-odds ratio with probabilistic justification. In practice, the curves are similar, but the BM25 formulation is theoretically grounded in the Binary Independence Model.

The Complete Formula

Putting it together, the BM25 score for a document D given a query Q with terms q1, q2, ..., qn is:

BM25(D, Q) = sum over each query term qi of:

            [ f(qi, D) * (k1 + 1) ]
IDF(qi) * ------------------------------------------------- 
            [ f(qi, D) + k1 * (1 - b + b * |D| / avgdl) ]

BM25 is additive across query terms. Each query term contributes independently to the total score. This bag-of-words assumption means term ordering and proximity are ignored. “New York” and “York New” produce identical scores. This is a meaningful limitation that we will return to.

Worked Example

To make this concrete, consider a small corpus of three documents and the query “inverted index”:

D1 (120 tokens): "An inverted index maps terms to the documents that contain them. Inverted indexes are the core data structure of search engines."

D2 (15 tokens):  "Inverted index: a data structure for fast retrieval."

D3 (800 tokens): "Search engines rely on many data structures. An inverted index is one of them. [... 790 more tokens of surrounding context ...]"

With default parameters k1 = 1.2, b = 0.75, and an avgdl of 312 tokens:

For the term “inverted”:

D1 has f = 2, moderate length: moderate TF contribution.
D2 has f = 1, much shorter than average: length normalization boosts it.
D3 has f = 1, much longer than average: length normalization penalizes it.

D2, despite only one occurrence of “inverted,” will likely score higher than D3 with its one occurrence buried in 800 tokens of noise. D1 with two occurrences in a focused document will likely come out on top overall.

This is the behavior you want. D2 is a definition. D3 is a tangential mention in a long document. The formula reflects that.

Tuning `k1` and `b`

The defaults work well out of the box for general text search. But your corpus is not a general text corpus, and tuning matters more than most engineers realize.

`k1`

Increase k1 (toward 2.0) when:

Documents are long, and term repetition is genuinely informative (legal documents, scientific papers, technical manuals)
Users search with domain-specific jargon that naturally recurs throughout relevant documents.

Decrease k1 (toward 0.5) when:

Documents are short (tweets, product titles, code identifiers)
A single mention is as informative as ten mentions (e.g., a product description that names the SKU once)
You are worried about adversarial keyword stuffing.

`b`

Set b closer to 1.0 when:

Your documents span a very wide range of lengths, and you want to prevent long documents from dominating.
Average document length is driven by padding or boilerplate rather than meaningful content.
You are indexing heterogeneous content (short FAQs mixed with long technical articles)

Set b closer to 0.0 when:

Document length is genuinely correlated with coverage and relevance (encyclopedic articles that are long because they are comprehensive)
Documents are all roughly the same length.
You are searching through code, where length carries semantic meaning.

What BM25 Cannot Do

BM25 is a bag-of-words model. That phrase contains its entire set of limitations.

do not understand synonyms i.e “heart attack” is different than “myocardial infarction” and “automobile accident” is different than “car crash”.
do not understand word order, i.e., “New York” and “York New” produce identical BM25 scores.
do not understand context or intent. A user searching for “python” might want programming documentation, a natural history article, or a Monty Python sketch. BM25 cannot distinguish query intent; it can only rank by term statistics.
struggles with rare or out-of-vocabulary terms

So, BM25 is excellent for keyword-heavy, factual queries where exact term matching is meaningful, and it degrades on semantic queries.

When To Use BM25 vs Alternatives

BM25 is the right default for lexical retrieval. The question is when lexical retrieval is the right choice at all.

Reach for BM25 when:

Exact keyword matching is the primary use case (legal document search, code search by function name, product search by SKU)
The corpus contains domain-specific jargon, product codes, or identifiers that embedding models cannot represent
need interpretable, auditable results (compliance environments, debugging production search)
need fast retrieval with no GPUs and minimal memory overhead

Augment BM25 with dense retrieval (semantic) when:

Users phrase queries in natural language with varying vocabulary (customer support, knowledge base search)
You have multilingual content where a user might search in one language for content in another.
Synonymy and paraphrasing are common in your domain (medical search, legal search, research literature)
You are building a RAG pipeline where recall matters more than precision.

BM25 is explicitly the wrong tool when the search task is primarily semantic, and your queries rarely use the same words as your documents. In those cases, BM25 provides recall for edge cases but should not be the primary retrieval mechanism.

BM25 in Elasticsearch

Elasticsearch computes IDF per-shard, not per-index. In a distributed cluster, each shard sees only its portion of the corpus when computing N and n(q). This means IDF values can vary across shards, producing slightly inconsistent scores for the same document depending on which shard it lives on.

For most use cases, this is fine, but if you need globally consistent IDF, use search_type=dfs_query_then_fetch, which forces a global term statistics collection step before scoring.

Field lengths in Elasticsearch’s BM25 are measured in tokens. So, the choice of analyzer affects what counts as a token and therefore affects length normalization. A language analyzer that applies stemming and stopword removal will produce shorter effective documents than a standard tokenizer. Tune analyzer choice before you tune b.

In Elasticsearch, document lengths are encoded in a single byte using a logarithmic encoding scheme. This means the stored length is an approximation, not the exact token count. For most cases, this is a negligible error, but it is worth knowing when you are trying to understand why BM25 scores do not exactly match hand calculations.

You can inspect explain output to see exactly how a score was computed as

GET /my_index/_search
{
  "explain": true,
  "query": {
    "match": {
      "body": "inverted index"
    }
  }
}

The _explanation field in the response shows the IDF, TF saturation, and length normalization components individually. This is genuinely useful for debugging unexpected rankings and one of BM25’s practical advantages over black-box scoring systems.

BM25 improves on TF-IDF through two mechanisms: a saturation function that prevents repeated terms from scoring linearly, and document length normalization that adjusts for corpus-wide length variance.

The two free parameters - k1 (saturation speed) and b (normalization strength) - offer meaningful tuning handles with sensible defaults.

BM25 is a bag-of-words model and cannot handle synonyms, word order, or semantic intent. In modern systems, it functions as the fast, interpretable, exact-matching leg of a hybrid retrieval pipeline, complementing dense vector search that handles semantic queries.

Its greatest practical advantage is debuggability: every score can be traced to specific term statistics, which matters more than most engineers expect when something goes wrong in production.