05. Q4_K_M vs Q8_0 vs Q2_K
The naming schemes for quantized models can be opaque. This chapter decodes the most common formats and their tradeoffs.
Format breakdown:
Q[bits]_[type]
Q4_K_M: 4-bit with K-quantization, medium optimizationQ8_0: 8-bit with no K-quantization, near-FP16 qualityQ2_K: 2-bit with K-quantization, minimal quality
K-quantization explained:
K-quantization uses different precision for different parameter groups. The most sensitive parameters get higher precision. The letter after K (M, S, etc.) indicates the optimization level-M is medium, S is small.
For example, Q4_K_M typically keeps:
- 20% of weights in Q6_K (6-bit per group)
- 80% of weights in Q4_K (4-bit per group)
VRAM comparison for 7B model:
| Format | VRAM (approx) | Relative quality |
|---|---|---|
| FP16 | 14GB | Baseline (100%) |
| Q8_0 | 7GB | ~99% of FP16 |
| Q5_K_M | 4.9GB | ~97% of FP16 |
| Q4_K_M | 4.1GB | ~95% of FP16 |
| Q3_K_M | 3.2GB | ~90% of FP16 |
| Q2_K | 2.7GB | ~85% of FP16 |
When to use each:
Q8_0: When you have enough VRAM and want quality as close to FP16 as possible. Good for final production runs if memory allows.
Q4_K_M: The default recommendation. Offers the best quality-per-VRAM ratio. Works for most tasks-coding, reasoning, chat. If you are unsure, start here.
Q2_K: Only when VRAM is severely limited. Output quality degrades noticeably. Useful for quickly testing a large model you cannot otherwise run.
The imatrix factor:
Some quantizations use an "importance matrix" (imatrix) computed from representative samples. This measures which weights matter most for accuracy and preserves them better. GGUF quantizations with _Q4_K_M often have a reference imatrix derived from the original model's calibration data.
# Example GGUF file naming
llama-3-8b-instruct-q4_k_m.gguf
# This file uses K-quantization at 4-bit with medium optimization
Download a small 3B model and convert it to both Q8_0 and Q4_K_M using llama.cpp. Compare file sizes and run a simple benchmark to measure quality difference on a known task.