04. Quantization Explained
Quantization reduces the precision of model weights to save memory and increase inference speed. Understanding the mechanics helps you choose the right quantization level for your quality/performance tradeoff.
The problem with full precision:
A 7B model in FP16 (16-bit floating point) requires:
7,000,000,000 params x 2 bytes = 14GB VRAM minimum
That does not fit on a 12GB GPU. Quantization reduces weight precision to fit more model in available memory.
Number formats:
| Format | Bits | Range | Use case |
|---|---|---|---|
| FP32 | 32 | -3.4e38 to 3.4e38 | Full precision, rarely needed |
| FP16 | 16 | -65504 to 65504 | Standard for modern training |
| BF16 | 16 | -3.4e38 to 3.4e38 | Better dynamic range than FP16 |
| INT8 | 8 | -128 to 127 | Common quantization target |
| INT4 | 4 | -8 to 7 | Aggressive compression |
How quantization works:
The goal is to map FP16 values to INT8/INT4 with minimal accuracy loss:
# Simple per-tensor quantization
def quantize_tensor(tensor_fp16):
# Find scale (max absolute value / quantization range)
max_val = max(abs(tensor_fp16).max(), 1e-10)
scale = max_val / 127.0 # For INT8
# Quantize
quantized = round(tensor_fp16 / scale)
quantized = clip(quantized, -127, 127)
return quantized.astype(int8), scale
def dequantize_tensor(quantized, scale):
return quantized.astype(fp16) * scale
The accuracy problem:
Naive quantization loses information. A weight of 0.015 in FP16 might quantize to 0 or 1 in INT8, and after dequantization returns a wrong value. This accumulates-small errors across billions of weights degrade model quality.
Why K-quantization exists:
The K in Q4_K_M and Q4_K_S refers to "K-means" quantization that groups weights and finds optimal cluster centers per parameter group, reducing accuracy loss compared to naive quantization.
Find the quantization method used in llama.cpp and compare it to GPTQ. List two advantages of each approach.