Quantization Explained — Understanding AI Models (Chapter 4)

Quantization reduces the precision of model weights to save memory and increase inference speed. Understanding the mechanics helps you choose the right quantization level for your quality/performance tradeoff.

The problem with full precision:

A 7B model in FP16 (16-bit floating point) requires:

7,000,000,000 params x 2 bytes = 14GB VRAM minimum

That does not fit on a 12GB GPU. Quantization reduces weight precision to fit more model in available memory.

Number formats:

Format	Bits	Range	Use case
FP32	32	-3.4e38 to 3.4e38	Full precision, rarely needed
FP16	16	-65504 to 65504	Standard for modern training
BF16	16	-3.4e38 to 3.4e38	Better dynamic range than FP16
INT8	8	-128 to 127	Common quantization target
INT4	4	-8 to 7	Aggressive compression

How quantization works:

The goal is to map FP16 values to INT8/INT4 with minimal accuracy loss:

# Simple per-tensor quantization
def quantize_tensor(tensor_fp16):
    # Find scale (max absolute value / quantization range)
    max_val = max(abs(tensor_fp16).max(), 1e-10)
    scale = max_val / 127.0  # For INT8
    
    # Quantize
    quantized = round(tensor_fp16 / scale)
    quantized = clip(quantized, -127, 127)
    
    return quantized.astype(int8), scale

def dequantize_tensor(quantized, scale):
    return quantized.astype(fp16) * scale

The accuracy problem:

Naive quantization loses information. A weight of 0.015 in FP16 might quantize to 0 or 1 in INT8, and after dequantization returns a wrong value. This accumulates-small errors across billions of weights degrade model quality.

Why K-quantization exists:

The K in Q4_K_M and Q4_K_S refers to "K-means" quantization that groups weights and finds optimal cluster centers per parameter group, reducing accuracy loss compared to naive quantization.