Q4_K_M vs Q8_0 vs Q2_K — Understanding AI Models (Chapter 5)

The naming schemes for quantized models can be opaque. This chapter decodes the most common formats and their tradeoffs.

Format breakdown:

Q[bits]_[type]

Q4_K_M: 4-bit with K-quantization, medium optimization
Q8_0: 8-bit with no K-quantization, near-FP16 quality
Q2_K: 2-bit with K-quantization, minimal quality

K-quantization explained:

K-quantization uses different precision for different parameter groups. The most sensitive parameters get higher precision. The letter after K (M, S, etc.) indicates the optimization level-M is medium, S is small.

For example, Q4_K_M typically keeps:

20% of weights in Q6_K (6-bit per group)
80% of weights in Q4_K (4-bit per group)

VRAM comparison for 7B model:

Format	VRAM (approx)	Relative quality
FP16	14GB	Baseline (100%)
Q8_0	7GB	~99% of FP16
Q5_K_M	4.9GB	~97% of FP16
Q4_K_M	4.1GB	~95% of FP16
Q3_K_M	3.2GB	~90% of FP16
Q2_K	2.7GB	~85% of FP16

When to use each:

Q8_0: When you have enough VRAM and want quality as close to FP16 as possible. Good for final production runs if memory allows.

Q4_K_M: The default recommendation. Offers the best quality-per-VRAM ratio. Works for most tasks-coding, reasoning, chat. If you are unsure, start here.

Q2_K: Only when VRAM is severely limited. Output quality degrades noticeably. Useful for quickly testing a large model you cannot otherwise run.

The imatrix factor:

Some quantizations use an "importance matrix" (imatrix) computed from representative samples. This measures which weights matter most for accuracy and preserves them better. GGUF quantizations with _Q4_K_M often have a reference imatrix derived from the original model's calibration data.

# Example GGUF file naming
llama-3-8b-instruct-q4_k_m.gguf
# This file uses K-quantization at 4-bit with medium optimization