Evaluation metrics

Perplexity

Perplexity is a metric that measures how well a language model predicts a sequence of tokens. Lower perplexity means the model is more confident and accurate in its predictions. It is calculated as the exponentiated average negative log-likelihood of the test set. For operators, perplexity is useful for comparing different models or quantization levels on the same dataset: a model with lower perplexity is generally better at generating coherent text. However, perplexity does not directly measure real-world performance like speed or VRAM usage.

Deeper dive

Perplexity is derived from the cross-entropy loss of a model on a given text corpus. For a model that assigns probability p(x) to a sequence of tokens, perplexity is 2^{-(1/N) Σ log₂ p(x_i)}. In practice, lower perplexity indicates that the model assigns higher probability to the actual tokens, meaning it is less 'perplexed' by the data. It is commonly used to evaluate language models on benchmarks like WikiText or LAMBADA. However, perplexity has limitations: it can be gamed by overfitting to the test set, and it does not capture aspects like factual accuracy or fluency. For operators, perplexity is most useful when comparing models of similar architecture or when testing the impact of quantization on model quality. A drop in perplexity after quantization may indicate quality loss, but small changes (e.g., <0.5) are often imperceptible in practice.

Practical example

When quantizing Llama 3.1 8B from FP16 to Q4_K_M using llama.cpp, the perplexity on WikiText-2 might increase from ~5.5 to ~5.7. This 0.2 increase is generally considered negligible for most use cases. In contrast, quantizing to Q2_K might raise perplexity to ~6.5, which can result in noticeably worse output quality. Operators can run ./perplexity -m model.gguf -f wikitext-2-raw.txt in llama.cpp to measure perplexity on their own test set.

Workflow example

After downloading a new GGUF model, an operator may run a perplexity test to verify its quality. Using llama.cpp, the command ./perplexity -m model.gguf -f test.txt outputs the perplexity score. If the score is significantly higher than expected (e.g., >10 for a 7B model), it may indicate a corrupt download or poor quantization. Operators also use perplexity to compare different quantization levels: a Q4_K_M model with perplexity 5.7 vs Q8_0 at 5.5 might be preferred for its lower VRAM usage if the quality difference is acceptable.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work