08. CUDA Kernel Basics

Chapter 8 of 18 · 20 min

CUDA kernel authoring for quantized inference requires understanding memory hierarchy, thread organization, and SIMT execution model. Properly structured kernels fully utilize GPU resources while respecting the constraints of low-precision arithmetic.

A CUDA kernel executes across a grid of thread blocks, each containing threads that share memory and synchronize. For quantized inference, the primary kernel pattern is block-wise quantization where each block processes a tensor segment independently, enabling trivial parallelization.

// Basic quantized GEMV kernel structure
#include <cuda_fp16.h>

// Device function for quantized multiply-accumulate
__device__ __forceinline__ float qmac_int8(
    int8_t a, int8_t b, float scale_a, float scale_b, float acc
) {
    // Dequantize and multiply in float
    float da = __int2float_rn(a) * scale_a;
    float db = __int2float_rn(b) * scale_b;
    return fma(da, db, acc);
}

__global__ void quantized_gemv_kernel(
    const int8_t* __restrict__ weight_q,
    const int8_t* __restrict__ input_q,
    const float* __restrict__ weight_scale,
    const float* __restrict__ input_scale,
    float* __restrict__ output,
    int rows,      // output dimension
    int cols,      // input dimension (must be multiple of 32)
    int weight_zero_point
) {
    int row = blockIdx.x * blockDim.x + threadIdx.x;
    
    if (row >= rows) return;
    
    // Each thread computes one output element
    int32_t accum_i = 0;
    
    // Process SIMD-friendly 32-element chunks
    const int CHUNK = 32;
    for (int col_base = 0; col_base < cols; col_base += CHUNK) {
        // Load weights and inputs as unified memory transactions
        __builtin_nv_load_unaligned2((uint4*)(weight_q + row * cols + col_base));
        
        for (int i = 0; i < CHUNK; i++) {
            int8_t w = weight_q[row * cols + col_base + i];
            int8_t in_val = input_q[col_base + i];
            
            // Convert from unsigned representation
            uint8_t uw = (uint8_t)(w - weight_zero_point);
            uint8_t uin = (uint8_t)in_val;
            
            // Subtract zero points for correct computation
            accum_i += (uw - weight_zero_point) * (uin - input_zero_point);
        }
    }
    
    // Final dequantization with scales
    float dequant_scale = weight_scale[row] * input_scale;
    output[row] = accum_i * dequant_scale;
}

Memory coalescing dramatically affects kernel performance. Accessing memory in patterns that align with GPU memory transaction sizes ensures full bandwidth utilization. For quantized data, the frequently irregular Access patterns of compressed formats require careful design to avoid memory divergence.

// Memory-coalesced quantized kernel with block-wise loading
__global__ void coalesced_q4_kernel(
    const uint8_t* __restrict__ qweight,    // packed Q4 data
    const float* __restrict__ qscale,       // per-block scales
    const int8_t* __restrict__ input,
    float* output,
    int64_t num_blocks,
    int64_t input_size
) {
    // Grid dimensions
    int output_idx = blockIdx.x * blockDim.x + threadIdx.x;
    int block_idx = blockIdx.y;
    
    // Each warp loads contiguous memory, then distributes
    int warp_id = threadIdx.x / 32;
    int lane_id = threadIdx.x % 32;
    
    // Shared memory for block data
    __shared__ float block_scales[32];
    __shared__ uint8_t block_weights[32 * 64];  // 32 weights per warp
    
    // Load scales (warp-level coalesced access)
    if (lane_id < 32) {
        block_scales[lane_id] = qscale[block_idx * 32 + lane_id];
    }
    __syncthreads();
    
    // Each warp loads its weight block with coalesced 128-byte transaction
    uint4* input_base = (uint4*)(qweight + block_idx * 32 * 64 + warp_id * 64);
    if (warp_id < 32) {
        uint4 wdata = input_base[lane_id];
        *((uint4*)(block_weights + warp_id * 64 + lane_id * 8)) = wdata;
    }
    __syncthreads();
    
    // Compute dot product for this output element
    float accum = 0.0f;
    for (int i = 0; i < 32; i++) {
        float w = dequantize_q4_nbit(block_weights[i * 32 + lane_id],
                                     block_scales[i]);
        float in_val = (float)input[block_idx * 32 + i] / input_scale;
        accum += w * in_val;
    }
    
    if (output_idx < output_size) {
        output[output_idx] = accum;
    }
}

Thread block sizing requires balancing occupancy against register pressure. Occupancy—active warps per multiprocessor—determines latency hiding capability, while register usage limits this occupancy. For typical quantized kernels, block sizes of 128 or 256 threads provide good occupancy without exhausting registers.

Warps execute instructions synchronously, meaning divergence in branch conditions serializes execution. Quantized weight decoding often requires conditional logic that causes warp divergence. Techniques like predication or stripping allow handling this divergence without fully serializing execution.

EXERCISE

Implement a CUDA kernel that dequantizes Q4_K format weights and performs matrix multiplication. Profile the kernel at different thread block sizes (64, 128, 256) to find optimal occupancy, measuring throughput in GFLOPS.