RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Evaluation metrics / Perplexity
Evaluation metrics

Perplexity

Perplexity is a metric that measures how well a language model predicts a sequence of tokens. Lower perplexity means the model is more confident and accurate in its predictions. It is calculated as the exponentiated average negative log-likelihood of the test set. For operators, perplexity is useful for comparing different models or quantization levels on the same dataset: a model with lower perplexity is generally better at generating coherent text. However, perplexity does not directly measure real-world performance like speed or VRAM usage.

Deeper dive

Perplexity is derived from the cross-entropy loss of a model on a given text corpus. For a model that assigns probability p(x) to a sequence of tokens, perplexity is 2^{-(1/N) Σ log₂ p(x_i)}. In practice, lower perplexity indicates that the model assigns higher probability to the actual tokens, meaning it is less 'perplexed' by the data. It is commonly used to evaluate language models on benchmarks like WikiText or LAMBADA. However, perplexity has limitations: it can be gamed by overfitting to the test set, and it does not capture aspects like factual accuracy or fluency. For operators, perplexity is most useful when comparing models of similar architecture or when testing the impact of quantization on model quality. A drop in perplexity after quantization may indicate quality loss, but small changes (e.g., <0.5) are often imperceptible in practice.

Practical example

When quantizing Llama 3.1 8B from FP16 to Q4_K_M using llama.cpp, the perplexity on WikiText-2 might increase from ~5.5 to ~5.7. This 0.2 increase is generally considered negligible for most use cases. In contrast, quantizing to Q2_K might raise perplexity to ~6.5, which can result in noticeably worse output quality. Operators can run ./perplexity -m model.gguf -f wikitext-2-raw.txt in llama.cpp to measure perplexity on their own test set.

Workflow example

After downloading a new GGUF model, an operator may run a perplexity test to verify its quality. Using llama.cpp, the command ./perplexity -m model.gguf -f test.txt outputs the perplexity score. If the score is significantly higher than expected (e.g., >10 for a 7B model), it may indicate a corrupt download or poor quantization. Operators also use perplexity to compare different quantization levels: a Q4_K_M model with perplexity 5.7 vs Q8_0 at 5.5 might be preferred for its lower VRAM usage if the quality difference is acceptable.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →