RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Custom Quantization and Kernels
  6. /Ch. 2
Custom Quantization and Kernels

02. Weight Quantization

Chapter 2 of 18 · 20 min
KEY INSIGHT

Per-channel weight quantization captures the natural variation in filter magnitudes across neural network layers, making it the preferred approach for most modern model architectures.

Weight quantization converts the static parameters of a neural network—weights and biases—from high-precision representations to lower bit-width formats. Unlike activation quantization, which occurs dynamically during inference, weight quantization happens once during model preparation, making it more tractable for optimization.

Per-channel weight quantization applies a separate scale factor for each output channel in convolutional and linear layers. This approach captures the variation in magnitude across different filters, dramatically reducing quantization error for many model architectures.

// Per-channel weight quantization in C++
struct QuantizedWeights {
    int8_t* data;           // quantized values
    float* scales;          // scale per channel
    int32_t zero_point;     // typically 0 for symmetric
    int channels;
    int kernel_size;
};

QuantizedWeights quantize_weight_channel(
    const float* weight_fp32,
    int out_channels,
    int in_channels,
    int group_size = 1
) {
    QuantizedWeights qw;
    qw.channels = out_channels;
    qw.data = new int8_t[out_channels * in_channels * group_size];
    qw.scales = new float[out_channels];
    
    for (int oc = 0; oc < out_channels; oc++) {
        // Find max absolute value for this output channel
        float max_abs = 0.0f;
        for (int ic = 0; ic < in_channels; ic++) {
            max_abs = fmaxf(max_abs, fabsf(weight_fp32[oc * in_channels + ic]));
        }
        
        float scale = max_abs / 127.0f;
        qw.scales[oc] = scale;
        
        // Quantize each weight
        for (int ic = 0; ic < in_channels; ic++) {
            float val = weight_fp32[oc * in_channels + ic];
            int8_t qval = (int8_t)std::round(val / scale);
            qw.data[oc * in_channels + ic] = qval;
        }
    }
    
    qw.zero_point = 0;  // symmetric
    return qw;
}

Group-wise quantization further segments weights within each channel into smaller blocks, typically ranging from 32 to 128 values per group. This finer granularity adapts to local weight distributions, improving accuracy for larger models where weight magnitudes vary significantly within a single channel.

The choice of quantization granularity balances accuracy against storage overhead. Per-tensor quantization stores only one scale value per layer but sacrifices precision. Per-channel maintains good accuracy while storing scales equal to output channels. Group-wise offers the best accuracy per memory budget at the cost ofincreased metadata complexity.

When selecting quantization targets, prioritize layers where weight distribution most benefits from granular scaling. Embedding layers typically quantize well at per-tensor granularity. Attention projections often require per-channel handling. Convolutional layers frequently benefit from per-channel or group-wise schemes due to large filter variations.

# Analyzing weight distributions for quantization planning
import torch

def analyze_weight_distribution(weight_tensor, granularity="channel"):
    """Analyze weight tensor for quantization feasibility."""
    weight_fp32 = weight_tensor.detach().float()
    
    if granularity == "channel":
        # Assume weight shape: [out_channels, in_channels, ...]
        scales = weight_fp32.abs().max(dim=1 if weight_fp32.dim() > 1 else 0).values
        reconstructed = weight_fp32 / scales.unsqueeze(1 if weight_fp32.dim() > 1 else 0) * scales.unsqueeze(1 if weight_fp32.dim() > 1 else 0)
        error = (weight_fp32 - reconstructed).abs().mean()
    else:
        scale = weight_fp32.abs().max()
        reconstructed = weight_fp32 / scale * scale
        error = (weight_fp32 - reconstructed).abs().mean()
    
    return error.item()
EXERCISE

Implement a function that compares quantization error across per-tensor, per-channel, and group-wise (size 64) granularities for a pre-trained linear layer. Report the error reduction ratio for each approach relative to per-tensor baseline.

← Chapter 1
Quantization Theory
Chapter 3 →
Activation Quantization