Weight Quantization — Custom Quantization and Kernels (Chapter 2)

Weight quantization converts the static parameters of a neural network—weights and biases—from high-precision representations to lower bit-width formats. Unlike activation quantization, which occurs dynamically during inference, weight quantization happens once during model preparation, making it more tractable for optimization.

Per-channel weight quantization applies a separate scale factor for each output channel in convolutional and linear layers. This approach captures the variation in magnitude across different filters, dramatically reducing quantization error for many model architectures.

// Per-channel weight quantization in C++
struct QuantizedWeights {
    int8_t* data;           // quantized values
    float* scales;          // scale per channel
    int32_t zero_point;     // typically 0 for symmetric
    int channels;
    int kernel_size;
};

QuantizedWeights quantize_weight_channel(
    const float* weight_fp32,
    int out_channels,
    int in_channels,
    int group_size = 1
) {
    QuantizedWeights qw;
    qw.channels = out_channels;
    qw.data = new int8_t[out_channels * in_channels * group_size];
    qw.scales = new float[out_channels];
    
    for (int oc = 0; oc < out_channels; oc++) {
        // Find max absolute value for this output channel
        float max_abs = 0.0f;
        for (int ic = 0; ic < in_channels; ic++) {
            max_abs = fmaxf(max_abs, fabsf(weight_fp32[oc * in_channels + ic]));
        }
        
        float scale = max_abs / 127.0f;
        qw.scales[oc] = scale;
        
        // Quantize each weight
        for (int ic = 0; ic < in_channels; ic++) {
            float val = weight_fp32[oc * in_channels + ic];
            int8_t qval = (int8_t)std::round(val / scale);
            qw.data[oc * in_channels + ic] = qval;
        }
    }
    
    qw.zero_point = 0;  // symmetric
    return qw;
}

Group-wise quantization further segments weights within each channel into smaller blocks, typically ranging from 32 to 128 values per group. This finer granularity adapts to local weight distributions, improving accuracy for larger models where weight magnitudes vary significantly within a single channel.

The choice of quantization granularity balances accuracy against storage overhead. Per-tensor quantization stores only one scale value per layer but sacrifices precision. Per-channel maintains good accuracy while storing scales equal to output channels. Group-wise offers the best accuracy per memory budget at the cost ofincreased metadata complexity.

When selecting quantization targets, prioritize layers where weight distribution most benefits from granular scaling. Embedding layers typically quantize well at per-tensor granularity. Attention projections often require per-channel handling. Convolutional layers frequently benefit from per-channel or group-wise schemes due to large filter variations.

# Analyzing weight distributions for quantization planning
import torch

def analyze_weight_distribution(weight_tensor, granularity="channel"):
    """Analyze weight tensor for quantization feasibility."""
    weight_fp32 = weight_tensor.detach().float()
    
    if granularity == "channel":
        # Assume weight shape: [out_channels, in_channels, ...]
        scales = weight_fp32.abs().max(dim=1 if weight_fp32.dim() > 1 else 0).values
        reconstructed = weight_fp32 / scales.unsqueeze(1 if weight_fp32.dim() > 1 else 0) * scales.unsqueeze(1 if weight_fp32.dim() > 1 else 0)
        error = (weight_fp32 - reconstructed).abs().mean()
    else:
        scale = weight_fp32.abs().max()
        reconstructed = weight_fp32 / scale * scale
        error = (weight_fp32 - reconstructed).abs().mean()
    
    return error.item()