How to compare model performance across different quantization levels
Same base model available in multiple quantization levels (e.g., Q4_K_M, Q5_K_M, Q8_0) pulled via Ollama, terminal access
What this does
Benchmarks identical prompts against the same model in multiple quantization levels, measuring latency and tokens-per-second to determine the best trade-off between resource usage and speed. After this guide a comparison table of Q4_K_M, Q5_K_M, and Q8_0 performance will be available.
Steps
Pull model quantizations. Ensures all variants are available locally.
ollama pull mistral:q4_K_M && ollama pull mistral:q5_K_M && ollama pull mistral:q8_0Each pull shows a progress indicator. Verify with
ollama list.Create a standard test prompt. Saves a shared prompt for consistent input.
echo "Explain the mechanism of photosynthesis in three sentences." > /tmp/test_prompt.txtBenchmark each quantization with timing. Records wall-clock time for each variant.
time ollama run mistral:q4_K_M "$(cat /tmp/test_prompt.txt)" time ollama run mistral:q5_K_M "$(cat /tmp/test_prompt.txt)" time ollama run mistral:q8_0 "$(cat /tmp/test_prompt.txt)"Expected output: Each command prints the response followed by real/user/sys time.
Tabulate results. Compare latency and evaluate response quality subjectively.
Verification
ollama list | grep mistral
# Expected: mistral with q4_K_M, q5_K_M, and q8_0 tags listed with size differences visible (Q8_0 largest, Q4_K_M smallest)
Common failures
- model not found: Quantization tag may not match; verify exact tag with
ollama show mistral:q4_K_M. - out of memory with Q8_0: Q8_0 requires significantly more RAM; close other applications or use a smaller base model.
- inconsistent prompts: Even trailing whitespace changes outputs; always use the same prompt file.
- slow inference on Q4_K_M: Some quantizations perform poorly on CPU-only systems; check GPU utilization.
- outdated Ollama version: Older versions may not support all quantization tags; upgrade to 0.4.x or later.
Operator checkpoint
Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.