RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Understanding AI Models
  6. /Ch. 4
Understanding AI Models

04. Quantization Explained

Chapter 4 of 20 · 15 min
KEY INSIGHT

Quantization trades VRAM for quality-the more aggressive the compression, the more capability you lose, but optimized methods minimize this loss.

Quantization reduces the precision of model weights to save memory and increase inference speed. Understanding the mechanics helps you choose the right quantization level for your quality/performance tradeoff.

The problem with full precision:

A 7B model in FP16 (16-bit floating point) requires:

7,000,000,000 params x 2 bytes = 14GB VRAM minimum

That does not fit on a 12GB GPU. Quantization reduces weight precision to fit more model in available memory.

Number formats:

Format Bits Range Use case
FP32 32 -3.4e38 to 3.4e38 Full precision, rarely needed
FP16 16 -65504 to 65504 Standard for modern training
BF16 16 -3.4e38 to 3.4e38 Better dynamic range than FP16
INT8 8 -128 to 127 Common quantization target
INT4 4 -8 to 7 Aggressive compression

How quantization works:

The goal is to map FP16 values to INT8/INT4 with minimal accuracy loss:

# Simple per-tensor quantization
def quantize_tensor(tensor_fp16):
    # Find scale (max absolute value / quantization range)
    max_val = max(abs(tensor_fp16).max(), 1e-10)
    scale = max_val / 127.0  # For INT8
    
    # Quantize
    quantized = round(tensor_fp16 / scale)
    quantized = clip(quantized, -127, 127)
    
    return quantized.astype(int8), scale

def dequantize_tensor(quantized, scale):
    return quantized.astype(fp16) * scale

The accuracy problem:

Naive quantization loses information. A weight of 0.015 in FP16 might quantize to 0 or 1 in INT8, and after dequantization returns a wrong value. This accumulates-small errors across billions of weights degrade model quality.

Why K-quantization exists:

The K in Q4_K_M and Q4_K_S refers to "K-means" quantization that groups weights and finds optimal cluster centers per parameter group, reducing accuracy loss compared to naive quantization.

EXERCISE

Find the quantization method used in llama.cpp and compare it to GPTQ. List two advantages of each approach.

← Chapter 3
Dense vs Mixture of Experts
Chapter 5 →
Q4_K_M vs Q8_0 vs Q2_K