RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Evaluation metrics / BLEU score
Evaluation metrics

BLEU score

BLEU (Bilingual Evaluation Understudy) is an automated metric that measures how similar a machine-generated text is to one or more human-written reference texts. It works by counting n-gram overlaps (unigrams, bigrams, trigrams, up to 4-grams) between the candidate and reference, then applying a brevity penalty to discourage overly short outputs. Scores range from 0 to 100 (or 0 to 1), with higher scores indicating closer match. BLEU is widely used in machine translation and text generation tasks, but it does not capture semantic meaning or fluency—only surface-level n-gram overlap. Operators encounter BLEU when evaluating model output quality, especially during fine-tuning or benchmarking against standard datasets like WMT.

Practical example

When fine-tuning a small translation model (e.g., NLLB-600M) on a consumer GPU like an RTX 4090, an operator might run evaluation on a held-out test set. A BLEU score of 30 on an English-to-French task indicates moderate quality—roughly 30% of n-grams match references. Compare this to a larger model like NLLB-3.3B, which might score 40+ on the same set. The operator uses BLEU to decide if the fine-tuned model is worth deploying, but also checks human evaluation because BLEU can be gamed by repeating common phrases.

Workflow example

In Hugging Face Transformers, an operator can compute BLEU using the evaluate library: import evaluate; bleu = evaluate.load('bleu'); results = bleu.compute(predictions=['the cat sat on the mat'], references=[['the cat is on the mat']]). In llama.cpp, BLEU is not built-in, but operators often pipe model output to a Python script that uses sacrebleu for standardized scoring. When benchmarking a model on the WMT dataset, the operator runs inference on the test set, collects translations, then runs sacrebleu -tok intl -b 4 reference.txt < output.txt to get the score.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →