RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Edge AI: Mobile and IoT
  6. /Ch. 8
Edge AI: Mobile and IoT

08. Extreme Quantization

Chapter 8 of 18 · 20 min
KEY INSIGHT

INT8 quantization typically reduces model size 4x with <1% accuracy loss, but aggressive quantization below 4-bit requires QAT or careful accuracy monitoring.

Quantization reduces numerical precision from 32-bit floating-point to 8-bit integer representations. This reduction cuts memory footprint by 4x, increases throughput by 2-4x on integer-capable hardware, and reduces power consumption by enabling efficient SIMD operations. Extreme quantization pushes toward 4-bit, 2-bit, or even binary representations.

INT8 quantization via PyTorch requires calibration with representative data:

import torch
from torch.quantization.quantize_fx import prepare_fx, convert_fx

# Prepare model for dynamic quantization
model.eval()

# Quantization configuration
qconfig = torch.quantization.get_default_qconfig('fbgemm')
model.prepared_for_quantization = prepare_fx(model, qconfig, example_inputs)

# Calibration with representative samples
with torch.no_grad():
    for batch in calibration_loader:
        model(batch)

# Convert to quantized model
quantized_model = convert_fx(model.prepared_for_quantization)

Training-aware quantization (QAT) produces better accuracy for aggressive quantization:

from torch.quantization import QuantStub, DeQuantStub

class QuantizedNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.quant = QuantStub()
        self.conv = nn.Conv2d(3, 64, 3, padding=1)
        self.relu = nn.ReLU()
        self.dequant = DeQuantStub()
    
    def forward(self, x):
        x = self.quant(x)
        x = self.conv(x)
        x = self.relu(x)
        x = self.dequant(x)
        return x

Per-channel quantization improves accuracy by quantizing each output channel with its own scale factor:

# In converter
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]

# Per-channel quantization uses symmetric ranges
def representative_dataset_generator():
    for image in dataset:
        yield [tf.cast(image, tf.float32)]

converter.representative_dataset = representative_dataset_generator
converter.experimental_enable_mlir_converter = True

tflite_model = converter.convert()

Accuracy degradation manifests differently across tasks. Image classification tolerates INT8 quantization without visible quality loss—output probabilities shift by <1%. Object detection suffers more severely because anchor assignments amplify quantization errors. Generative tasks (style transfer, image enhancement) exhibit banding artifacts at INT8 that INT16 intermediate calculations can mitigate.

Benchmarking quantized models requires comparing both accuracy and latency:

import numpy as np
import time

def benchmark_inference(interpreter, num_runs=100, warmup=10):
    # Warmup
    for _ in range(warmup):
        interpreter.invoke()
    
    # Benchmark
    timings = []
    for _ in range(num_runs):
        start = time.perf_counter()
        interpreter.invoke()
        timings.append((time.perf_counter() - start) * 1000)
    
    return {
        "mean_ms": np.mean(timings),
        "std_ms": np.std(timings),
        "p95_ms": np.percentile(timings, 95)
    }
EXERCISE

Quantize a PyTorch model using dynamic quantization, compare inference speed and memory usage against the float32 baseline, and verify output quality degradation.

← Chapter 7
ML Kit for Android
Chapter 9 →
2-bit and 3-bit Quantization