Activation Quantization — Custom Quantization and Kernels (Chapter 3)

Activation quantization presents unique challenges distinct from weight quantization. Unlike weights, which are static and determined prior to inference, activations vary with each input sample, requiring runtime measurement and adaptation strategies.

The fundamental problem is that activation value ranges are unknown until inference begins. A model processing different inputs encounters varying activation magnitudes—the same layer might see values ranging from 0.1 to 10.0 with one input and 0.01 to 1.0 with another. Static calibration cannot capture this runtime variation.

// CUDA kernel for activation quantization with per-tensor scale
__global__ void quantize_activations_kernel(
    const float* __restrict__ input,
    int8_t* output,
    float scale,
    int size,
    float inv_scale
) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < size) {
        float val = input[idx];
        float quantized = __ roundf(val * inv_scale);
        
        // Clamp to int8 range with saturation
        quantized = fminf(127.0f, fmaxf(-127.0f, quantized));
        output[idx] = (int8_t)quantized;
    }
}

void quantize_activations_launch(
    const float* input,
    int8_t* output,
    float scale,
    int size,
    cudaStream_t stream
) {
    dim3 block(256);
    dim3 grid((size + 255) / 256);
    
    float inv_scale = 1.0f / scale;
    
    quantize_activations_kernel<<<grid, block, 0, stream>>>(
        input, output, scale, size, inv_scale
    );
}

Per-token activation quantization assigns different scale factors for each token in a batch, accommodating the natural variation in activation magnitudes across sequence positions. This approach maintains precision across varied inputs without the storage overhead of per-channel activation quantization.

Smoothquant addresses activation quantization challenges by redistribution—transferring a portion of the activation magnitude burden to weights. A smoothing factor α (typically 0.5) modifies weight scales while compensating in activation scales, making the network more quantization-friendly.

import torch

def smoothquant_activate_scales(layer_weight, layer_input, alpha=0.5):
    """
    Compute activation scales for smoothquant.
    Moves magnitude from activations to weights.
    """
    # Per-channel scales based on activation magnitude
    act_scales = layer_input.abs().mean(dim=0)  # [in_features]
    
    # Per-channel scales based on weight magnitude  
    weight_scales = layer_weight.abs().mean(dim=1)  # [out_features]
    
    # Balance factor: higher alpha -> more magnitude moves to weights
    combined_scales = torch.pow(act_scales, alpha) * torch.pow(weight_scales, 1 - alpha)
    
    # Effective activation scale after smoothing
    act_scaling = combined_scales / act_scales
    
    return act_scaling

Dynamic quantization computes activation scales during inference, recalculating per forward pass. While accurate, this approach adds computational overhead that partially offsets quantization benefits. Practical systems often employ hybrid strategies: conservative per-tensor scales for stable layers, dynamic computation for layers with high activation variance.

The interplay between activation quantization and hardware support determines practical performance. Modern accelerators provide native int8 matrix multiplication, meaning quantized inference speed depends on avoiding dequantization and maintaining low-precision computation throughout critical paths.