BLEU score — AI glossary

BLEU (Bilingual Evaluation Understudy) is an automated metric that measures how similar a machine-generated text is to one or more human-written reference texts. It works by counting n-gram overlaps (unigrams, bigrams, trigrams, up to 4-grams) between the candidate and reference, then applying a brevity penalty to discourage overly short outputs. Scores range from 0 to 100 (or 0 to 1), with higher scores indicating closer match. BLEU is widely used in machine translation and text generation tasks, but it does not capture semantic meaning or fluency—only surface-level n-gram overlap. Operators encounter BLEU when evaluating model output quality, especially during fine-tuning or benchmarking against standard datasets like WMT.

Practical example

When fine-tuning a small translation model (e.g., NLLB-600M) on a consumer GPU like an RTX 4090, an operator might run evaluation on a held-out test set. A BLEU score of 30 on an English-to-French task indicates moderate quality—roughly 30% of n-grams match references. Compare this to a larger model like NLLB-3.3B, which might score 40+ on the same set. The operator uses BLEU to decide if the fine-tuned model is worth deploying, but also checks human evaluation because BLEU can be gamed by repeating common phrases.

Workflow example

In Hugging Face Transformers, an operator can compute BLEU using the evaluate library: import evaluate; bleu = evaluate.load('bleu'); results = bleu.compute(predictions=['the cat sat on the mat'], references=[['the cat is on the mat']]). In llama.cpp, BLEU is not built-in, but operators often pipe model output to a Python script that uses sacrebleu for standardized scoring. When benchmarking a model on the WMT dataset, the operator runs inference on the test set, collects translations, then runs sacrebleu -tok intl -b 4 reference.txt < output.txt to get the score.