RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Custom Quantization and Kernels
  6. /Ch. 3
Custom Quantization and Kernels

03. Activation Quantization

Chapter 3 of 18 · 20 min
KEY INSIGHT

Activation quantization requires balancing runtime flexibility against computational overhead, with techniques like smoothquant redistributing the quantization difficulty from activations to easier-to-quantize weights.

Activation quantization presents unique challenges distinct from weight quantization. Unlike weights, which are static and determined prior to inference, activations vary with each input sample, requiring runtime measurement and adaptation strategies.

The fundamental problem is that activation value ranges are unknown until inference begins. A model processing different inputs encounters varying activation magnitudes—the same layer might see values ranging from 0.1 to 10.0 with one input and 0.01 to 1.0 with another. Static calibration cannot capture this runtime variation.

// CUDA kernel for activation quantization with per-tensor scale
__global__ void quantize_activations_kernel(
    const float* __restrict__ input,
    int8_t* output,
    float scale,
    int size,
    float inv_scale
) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < size) {
        float val = input[idx];
        float quantized = __ roundf(val * inv_scale);
        
        // Clamp to int8 range with saturation
        quantized = fminf(127.0f, fmaxf(-127.0f, quantized));
        output[idx] = (int8_t)quantized;
    }
}

void quantize_activations_launch(
    const float* input,
    int8_t* output,
    float scale,
    int size,
    cudaStream_t stream
) {
    dim3 block(256);
    dim3 grid((size + 255) / 256);
    
    float inv_scale = 1.0f / scale;
    
    quantize_activations_kernel<<<grid, block, 0, stream>>>(
        input, output, scale, size, inv_scale
    );
}

Per-token activation quantization assigns different scale factors for each token in a batch, accommodating the natural variation in activation magnitudes across sequence positions. This approach maintains precision across varied inputs without the storage overhead of per-channel activation quantization.

Smoothquant addresses activation quantization challenges by redistribution—transferring a portion of the activation magnitude burden to weights. A smoothing factor α (typically 0.5) modifies weight scales while compensating in activation scales, making the network more quantization-friendly.

import torch

def smoothquant_activate_scales(layer_weight, layer_input, alpha=0.5):
    """
    Compute activation scales for smoothquant.
    Moves magnitude from activations to weights.
    """
    # Per-channel scales based on activation magnitude
    act_scales = layer_input.abs().mean(dim=0)  # [in_features]
    
    # Per-channel scales based on weight magnitude  
    weight_scales = layer_weight.abs().mean(dim=1)  # [out_features]
    
    # Balance factor: higher alpha -> more magnitude moves to weights
    combined_scales = torch.pow(act_scales, alpha) * torch.pow(weight_scales, 1 - alpha)
    
    # Effective activation scale after smoothing
    act_scaling = combined_scales / act_scales
    
    return act_scaling

Dynamic quantization computes activation scales during inference, recalculating per forward pass. While accurate, this approach adds computational overhead that partially offsets quantization benefits. Practical systems often employ hybrid strategies: conservative per-tensor scales for stable layers, dynamic computation for layers with high activation variance.

The interplay between activation quantization and hardware support determines practical performance. Modern accelerators provide native int8 matrix multiplication, meaning quantized inference speed depends on avoiding dequantization and maintaining low-precision computation throughout critical paths.

EXERCISE

Implement a dynamic quantization system that tracks running statistics of activation magnitudes and updates scale factors periodically. Measure the accuracy impact compared to static calibration using the first 100 samples as calibration data.

← Chapter 2
Weight Quantization
Chapter 4 →
Calibration Datasets