RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Custom Quantization and Kernels
  6. /Ch. 10
Custom Quantization and Kernels

10. Kernel Optimization

Chapter 10 of 18 · 15 min
KEY INSIGHT

Kernel optimization follows diminishing returns—measure before optimizing, and focus on the dominant bottleneck (memory bandwidth, instruction throughput, or latency hiding).

Performance optimization for custom quantization kernels requires systematic profiling and iterative improvement. The goal is achieving peak throughput while maintaining numerical accuracy across quantized weight distributions.

Occupancy and Thread Block Design

Thread block occupancy directly impacts GPU utilization. Calculate theoretical occupancy using shared memory and register constraints:

__global__ void quantized_gemm_kernel(
    const int8_t* __restrict__ A,
    const int8_t* __restrict__ B,
    float* C, int M, int N, int K,
    const float* scale_a, const float* scale_b,
    float output_scale, int vec_len) {
    
    __shared__ float shA[128][4];
    __shared__ float shB[128][4];
    
    int tx = threadIdx.x;
    int ty = threadIdx.y;
    int block_x = blockIdx.x;
    int block_y = blockIdx.y;
    
    float acc[4][4] = {0.0f};
    
    for (int kb = 0; kb < (K + vec_len - 1) / vec_len; kb++) {
        // Cooperative load into shared memory
        int base_k = kb * vec_len;
        load_vector_gemm(shA, shB, A, B, block_x, block_y,
                         tx, ty, base_k, vec_len, K);
        __syncthreads();
        
        // Compute partial result
        for (int k = 0; k < vec_len; k++) {
            for (int i = 0; i < 4; i++) {
                for (int j = 0; j < 4; j++) {
                    acc[i][j] += shA[tx * 4 + i][k] * shB[ty * 4 + j][k];
                }
            }
        }
        __syncthreads();
    }
    
    // Write output with scaling
    write_output(C, acc, block_x, block_y, tx, ty, M, N, 
                 scale_a, scale_b, output_scale);
}

Instruction-Level Optimization

Loop unrolling reduces pipeline stalls. Manually unroll inner loops when trip counts are known or bounded:

#pragma unroll 8
for (int i = 0; i < 8; i++) {
    int idx = base_idx + i;
    if (idx < K) {
        int8_t a_val = ldg(&A[idx * K + row_a]);
        int8_t b_val = ldg(&B[col_b * K + idx]);
        accumulator += convert_to_float(a_val) * convert_to_float(b_val);
    }
}

Warp-level primitives enable efficient reductions and data movement without synchronization barriers.

EXERCISE

Profile a quantized GEMM kernel with nvtx ranges and identify the limiting factor using ncu metrics. Implement one optimization and measure the improvement.

← Chapter 9
Memory Coalescing
Chapter 11 →
TensorRT Plugin Development