BM25 Demystified: A Simple, Robust Baseline for Modern Retrieval

BM25 remains the go-to search baseline: this post shows you how its TF saturation, length normalization, and probabilistic foundation keep it robust.

BM25 is the classic ranking formula from 90's. I used to assume that BM25 had been long overshadowed by today's billion-parameters transformers. Yet, the search results are still sorted by BM25. Why does this decode-old algorithm remain relevant in the middle of the neural-network boom? Let's find out.

From Raw Counts to TF-IDF

The early search engine relied on word frequency to decide relevancy. The assumption was simple: the more times a query term appears in a page, the more that page must matter. Researchers soon proposed a clever weighting formula called TFIDF, built from two ideas:

  • Term Frequency (TF): the more a word appears in a page, the more that page probably care about it.
  • Inverse Document Frequency (IDF): the rarer that word is across the document collection, the more weight it deserves when it does show up.

Put TF and IDF together and you get a single weight for each word-document pair. Formally,

$$ \text{score}(w,d) = \text{TF}(w,d) \times \text{IDF}(w) $$


In practice, this weight is computed for every query word in every document, sum the weights, and rank documents by the total score. The product sharply down-weights common words and boosts documents that feature rarer, more informative terms.

FROM TF-IDF to BM25

Despite its improvement, TF-IDF has two critical drawbacks:

  • No penalty for excessive repetition. The score increases linearly with term frequency, so repeating a keyword a thousand times can boost the score significantly.
  • No adjustment for document length. Long documents contain more words. Without length normalization, they receive an unfair advantage.

To mitigate these two weakness, researchers developed BM25 ("best match 25"), a ranking function that extends TF-IDF with two crucial controls:

  • Term Frequency Saturation. This is for curbing keyword stuffing. Each repetition adds progressively less weight, so spamming a term stops boosting the score.
  • Length normalization. This is for penalizing excessive verbosity, allowing concise pages to be ranked fairly.

How the controls are applied

BM25 replaces linear TF with

$$ \frac{\text{TF}(w,d)\cdot(k_1+1)}{\text{TF}(w,d)+k_1} $$
where $k_1 \approx 1.2$ decides how quickly the curve flattens.

It add length-scaled denominator by further dividing the TF term with:
$$ 1 - b + b \cdot \frac{|D|}{\text{avgdl}}$$
where $|D|$ is the document length, $\text{avgdl}$ is the average length in the collection, and $b \approx 0.75$ controls how strong the length penalty is.

Put everything together, the full BM25 formula is:

$$\text{BM25}(Q,D) = \sum_{w\in Q}\text{IDF}(w)\cdot\frac{\text{TF}(w,d)\cdot(k_1+1)}{\text{TF}(w,d)+k_1\cdot(1 - b + b \cdot \frac{|D|}{\text{avgdl}})}$$

  • Q is a query ($Q = w_1, w_2, \cdots, w_k$)
  • D is a document
  • $\text{IDF}$ is an inverse-document frequency of $w$

With this improvement, BM25 rewards informative terms, dampens over-repetition, and evens out lengths advantage.

BM25 vs TFIDF toy examples

Let's compare BM25's effectiveness against TF-IDF with a simple example. Query: "data science". Assume a corpus of 500 documents and these five documents:

  1. Doc 1
    Data science is an interdisciplinary field that uses scientific methods for extracting insights from data.
  2. Doc 2
    data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data
  3. Doc 3
    Scientific experiments in a lab setting explore hypothesis testing.
  4. Doc 4
    Sports news and fitness routines keep you in shape.
  5. Doc 5
    The science museum displays historical artifacts, and the data about visitor numbers is logged daily.

Doc 3 and 4 are irrelevant; Doc 2 is spam; Docs 1 and 5 mention both "data" and "science".
Scoring yields:

Document ID BM25 TF-IDF
Doc 1 13.5796 0.3855
Doc 2 11.6369 0.6292
Doc 3 0.0000 0.0000
Doc 4 0.0000 0.0000
Doc 5 11.4628 0.2698
  • BM25 ranking: Doc 1, Doc 2, Doc 5, Doc 3, Doc 4
  • TF-IDF ranking: Doc 2, Doc 1, Doc 5, Doc 3, Doc 4

TF-IDF overscores spammy repetition: Doc 2's endless "data" earns a higher TF-IDF score (0.6292) than the relevant Doc 1 (0.3855). BM25's saturation and length penalty ensure Doc 1 stays on top and pushes Doc 2 down.

Key takeaways:

TF-IDF overscores spammy repetition. It can't cap the impact of ultra-frequent terms.
BM25 enforces term-frequency saturation and length normalization. It limits returns from repetition and penalizes verbose, spammy docs, yielding more robust rankings.

Probabilistic Foundation of BM25

At first glance, BM25 might look like a more elaborate TF-IDF. But it actually has a solid probabilistic foundation, rooted in early retrieval theory. One of its closest relatives is the Query Likelihood Model (QLM).

Query Likelihood Model (QLM)

In the Query Likelihood Model (QLM), each document $D$ is treated as a simple unigram language model. The maximum-likelihood estimate for a word $w$ in document $D$ is:

$$ P(w|D) = \frac{\text{tf}(w,D)}{|D|} $$
For query $Q = {q_1, q_2, \cdots, q_k}$ , we score documents by how likely they are to generate $Q$. In other words, we treat a document as a bag of words, and we blindfold pick one word from the bag. We score a document with higher score if we are likely to draw word found in the given query.
Hence, the score can be defined as:

$$ \text{Score}(Q, D) = P(Q|D) = \prod_{w \in Q}P(w|D) $$
Since any missing term would give zero probability, we extend QLM with Dirichlet prior smoothing:

$$ P(w|D) = \frac{\text{tf}(w,D)+\mu\cdot P(w|C)}{|D|+\mu}$$
where $P(w|C)$ is the term's frequency in the entire document collection. $\mu$ controls how much we trust the collection over the document.

Connection BM25 and Dirichlet QLM

Although their formulas differ, BM25 can be viewed as a clever approximation of smoothed QLM. Sketching the connection here:

Start from the log-likelihood:
$$ P(Q|D) = \prod_{w\in Q}\frac{\text{tf}(w,D)+\mu\cdot P(w|C)}{|D|+\mu}$$
Take logs and drop constants that do not affect ranking:

$$ \sum_{w \in Q}\log\big(\text{tf}(w,D)+\mu \cdot P(w|C)\big) - |Q|\log(|D|+\mu)$$

Term $|Q|\log(|D|+\mu)$ is a length penalty. The higher of this term, the more penalty is incurred.
Next step, factoring out the background weight:
$$ \text{tf}(w,D)+\mu \cdot P(w|C) = \mu \cdot P(w|C)\big[1+\frac{\text{tf}(w,D)}{\mu \cdot P(w|C)}\big]$$
Dropping $\log(\mu \cdot P(w|C))$ because it does not depend of term frequency, then it leaves:
$$ \log\big[1+\frac{\text{tf}(w,D)}{\mu \cdot P(w|C)}\big]$$

Term $\frac{1}{P(w|C)}$ factor boosts rare words. The term $\text{tf}(w,D)$ inside the log gives term-frequency saturation. We will compare this expression with BM25 formula.

Lets revisit BM25's explicit form. BM25 is written as a product of two parts:

  • IDF boost:$$ \text{IDF}(w) = \log (\frac{N-\text{df}(w)+0.5}{\text{df}(w)+0.5}) $$
  • TF saturation and length normalization
    $$ \frac{(k_1+1)\text{tf}(w,D)}{\text{tf}(w,D)+k_1\cdot(1-b+b\frac{|D|}{\text{avgdl}})}$$
    This curve rises quickly for small term frequency and levels off for large term frequency, with $b$ controlling document-length penalty.

When we compare QLM with BM25, we have the following parameter mapping:

  • The log saturation in QLM $\approx$ the TF curve in BM25
  • $\mu$ in QLM $\approx$ $k_1\cdot(1-b+b\frac{|D|}{\text{avgdl}})$ in BM25
  • $\frac{1}{P(w|C)}$ in QLM shows up as IDF in BM25

Key takeaways

  • TF saturation: diminishing returns as term frequency increases
  • Rarity boost: rare words get higher weight
  • Length normalization: penalizes overly long documents

BM25 has a connection with a probabilistic retrieval model. It also has more intuitive parameters $k_1$ and $b$ which making it both fast and easy to tune compared to the single pseudo-count $\mu$ in QLM.

Closing

In an era dominated by massive neural models, it’s easy to overlook the enduring power of well-tuned classics. BM25 strikes the perfect balance between simplicity, efficiency, and effectiveness by enforcing term-frequency saturation, applying sensible length normalization, and standing on a firm probabilistic footing. Whether you’re running a small web search or powering at-scale retrieval in a billion-document index, BM25 remains a go-to baseline: fast to compute, easy to tune, and surprisingly robust against both keyword stuffing and varied document lengths. As you explore more advanced retrieval techniques, BM25 may not grab headlines like the latest transformer, but it continues to set the bar for best match.