08. CUDA Kernel Basics
CUDA kernel authoring for quantized inference requires understanding memory hierarchy, thread organization, and SIMT execution model. Properly structured kernels fully utilize GPU resources while respecting the constraints of low-precision arithmetic.
A CUDA kernel executes across a grid of thread blocks, each containing threads that share memory and synchronize. For quantized inference, the primary kernel pattern is block-wise quantization where each block processes a tensor segment independently, enabling trivial parallelization.
// Basic quantized GEMV kernel structure
#include <cuda_fp16.h>
// Device function for quantized multiply-accumulate
__device__ __forceinline__ float qmac_int8(
int8_t a, int8_t b, float scale_a, float scale_b, float acc
) {
// Dequantize and multiply in float
float da = __int2float_rn(a) * scale_a;
float db = __int2float_rn(b) * scale_b;
return fma(da, db, acc);
}
__global__ void quantized_gemv_kernel(
const int8_t* __restrict__ weight_q,
const int8_t* __restrict__ input_q,
const float* __restrict__ weight_scale,
const float* __restrict__ input_scale,
float* __restrict__ output,
int rows, // output dimension
int cols, // input dimension (must be multiple of 32)
int weight_zero_point
) {
int row = blockIdx.x * blockDim.x + threadIdx.x;
if (row >= rows) return;
// Each thread computes one output element
int32_t accum_i = 0;
// Process SIMD-friendly 32-element chunks
const int CHUNK = 32;
for (int col_base = 0; col_base < cols; col_base += CHUNK) {
// Load weights and inputs as unified memory transactions
__builtin_nv_load_unaligned2((uint4*)(weight_q + row * cols + col_base));
for (int i = 0; i < CHUNK; i++) {
int8_t w = weight_q[row * cols + col_base + i];
int8_t in_val = input_q[col_base + i];
// Convert from unsigned representation
uint8_t uw = (uint8_t)(w - weight_zero_point);
uint8_t uin = (uint8_t)in_val;
// Subtract zero points for correct computation
accum_i += (uw - weight_zero_point) * (uin - input_zero_point);
}
}
// Final dequantization with scales
float dequant_scale = weight_scale[row] * input_scale;
output[row] = accum_i * dequant_scale;
}
Memory coalescing dramatically affects kernel performance. Accessing memory in patterns that align with GPU memory transaction sizes ensures full bandwidth utilization. For quantized data, the frequently irregular Access patterns of compressed formats require careful design to avoid memory divergence.
// Memory-coalesced quantized kernel with block-wise loading
__global__ void coalesced_q4_kernel(
const uint8_t* __restrict__ qweight, // packed Q4 data
const float* __restrict__ qscale, // per-block scales
const int8_t* __restrict__ input,
float* output,
int64_t num_blocks,
int64_t input_size
) {
// Grid dimensions
int output_idx = blockIdx.x * blockDim.x + threadIdx.x;
int block_idx = blockIdx.y;
// Each warp loads contiguous memory, then distributes
int warp_id = threadIdx.x / 32;
int lane_id = threadIdx.x % 32;
// Shared memory for block data
__shared__ float block_scales[32];
__shared__ uint8_t block_weights[32 * 64]; // 32 weights per warp
// Load scales (warp-level coalesced access)
if (lane_id < 32) {
block_scales[lane_id] = qscale[block_idx * 32 + lane_id];
}
__syncthreads();
// Each warp loads its weight block with coalesced 128-byte transaction
uint4* input_base = (uint4*)(qweight + block_idx * 32 * 64 + warp_id * 64);
if (warp_id < 32) {
uint4 wdata = input_base[lane_id];
*((uint4*)(block_weights + warp_id * 64 + lane_id * 8)) = wdata;
}
__syncthreads();
// Compute dot product for this output element
float accum = 0.0f;
for (int i = 0; i < 32; i++) {
float w = dequantize_q4_nbit(block_weights[i * 32 + lane_id],
block_scales[i]);
float in_val = (float)input[block_idx * 32 + i] / input_scale;
accum += w * in_val;
}
if (output_idx < output_size) {
output[output_idx] = accum;
}
}
Thread block sizing requires balancing occupancy against register pressure. Occupancy—active warps per multiprocessor—determines latency hiding capability, while register usage limits this occupancy. For typical quantized kernels, block sizes of 128 or 256 threads provide good occupancy without exhausting registers.
Warps execute instructions synchronously, meaning divergence in branch conditions serializes execution. Quantized weight decoding often requires conditional logic that causes warp divergence. Techniques like predication or stripping allow handling this divergence without fully serializing execution.
Implement a CUDA kernel that dequantizes Q4_K format weights and performs matrix multiplication. Profile the kernel at different thread block sizes (64, 128, 256) to find optimal occupancy, measuring throughput in GFLOPS.