Quantization Theory — Custom Quantization and Kernels (Chapter 1)

Quantization transforms continuous floating-point values into discrete representations that require less memory and compute. In local AI inference, this process determines how model weights and activations are stored and processed, directly impacting inference speed, memory consumption, and model accuracy.

The fundamental challenge of quantization lies in mapping a range of floating-point values to a limited set of discrete values. Consider a weight matrix containing values ranging from -2.5 to 3.8. Quantization must compress this continuous distribution into, for example, 256 unique values (for int8) while preserving the semantic meaning of the original numbers.

Quantization operates through two primary parameters: scale and zero-point. The scale factor (typically a float32 value) maps the quantization domain to real values, while the zero-point (zero-point offset) ensures that zero maps exactly, which matters for representing padding and bias terms correctly.

class Quantizer:
    def __init__(self, bits=8, scheme="asymmetric"):
        self.bits = bits
        self.scheme = scheme
        self.scale = None
        self.zero_point = None
    
    def quantize(self, tensor):
        qmin, qmax = 0, (1 << self.bits) - 1
        
        if self.scheme == "asymmetric":
            min_val = tensor.min()
            max_val = tensor.max()
            self.scale = (max_val - min_val) / (qmax - qmin)
            self.zero_point = torch.round(qmin - min_val / self.scale)
        else:  # symmetric
            max_val = tensor.abs().max()
            self.scale = max_val / (qmax // 2)
            self.zero_point = 0
        
        quantized = torch.clamp(
            torch.round(tensor / self.scale) + self.zero_point,
            qmin, qmax
        ).to(torch.int8)
        
        return quantized

Two quantization approachs dominate practical implementations. Symmetric quantization uses the same scale for positive and negative ranges, simplifying computation but wasting one quantization level. Asymmetric quantization separately handles the positive and negative ranges, preserving precision near zero while using scale factors that match the data distribution more accurately.

The quantization error, the difference between original and dequantized values, compounds through neural network layers. A layer operating on already-quantized inputs further propagates this error. Understanding this accumulation effect explains why careful calibration and per-tensor versus per-channel quantization strategies diverge in practice.