Accuracy
Accuracy measures how often a model's predictions match the expected ground truth, typically expressed as a percentage (e.g., 95% means 95 out of 100 predictions are correct). In local AI, accuracy is a key metric for evaluating model performance on specific tasks like classification or question answering. It matters because quantization can reduce accuracy by 1-5% depending on the method and bit-width, so operators must balance accuracy against VRAM usage and inference speed.
Deeper dive
Accuracy is defined as (number of correct predictions) / (total predictions). It is straightforward for tasks with clear right/wrong answers, like image classification (e.g., 'cat' vs 'dog') or multiple-choice QA. However, accuracy can be misleading for imbalanced datasets (e.g., 99% of samples are 'cat' — a model that always guesses 'cat' gets 99% accuracy but is useless). Operators often use complementary metrics like precision, recall, F1-score, or perplexity for generative tasks. When quantizing a model, accuracy changes are measured on a held-out validation set; a drop of >2% may indicate the quantization method is too aggressive. For local deployment, accuracy is one of several trade-offs: a 4-bit quantized model may lose 1-3% accuracy but run on a 6 GB VRAM card instead of requiring 24 GB.
Practical example
A 7B parameter model like Mistral 7B achieves ~85% accuracy on the MMLU benchmark at full precision (FP16). When quantized to 4-bit using GPTQ, accuracy drops to ~83%. The operator must decide: the 2% loss is acceptable for running on an RTX 3060 (12 GB VRAM) instead of needing an RTX 4090 (24 GB).
Workflow example
In llama.cpp, after quantizing a model, operators run ./main -m model-q4_K_M.gguf -p "Question: ..." and compare outputs to a test set. Tools like lm_eval (EleutherAI) automate accuracy measurement: lm_eval --model hf --model_args pretrained=model --tasks mmlu. The reported accuracy helps decide whether to use Q4_K_M or Q5_K_M quantization.
Reviewed by Fredoline Eruo. See our editorial policy.