10. Kernel Optimization
Chapter 10 of 18 · 15 min
Performance optimization for custom quantization kernels requires systematic profiling and iterative improvement. The goal is achieving peak throughput while maintaining numerical accuracy across quantized weight distributions.
Occupancy and Thread Block Design
Thread block occupancy directly impacts GPU utilization. Calculate theoretical occupancy using shared memory and register constraints:
__global__ void quantized_gemm_kernel(
const int8_t* __restrict__ A,
const int8_t* __restrict__ B,
float* C, int M, int N, int K,
const float* scale_a, const float* scale_b,
float output_scale, int vec_len) {
__shared__ float shA[128][4];
__shared__ float shB[128][4];
int tx = threadIdx.x;
int ty = threadIdx.y;
int block_x = blockIdx.x;
int block_y = blockIdx.y;
float acc[4][4] = {0.0f};
for (int kb = 0; kb < (K + vec_len - 1) / vec_len; kb++) {
// Cooperative load into shared memory
int base_k = kb * vec_len;
load_vector_gemm(shA, shB, A, B, block_x, block_y,
tx, ty, base_k, vec_len, K);
__syncthreads();
// Compute partial result
for (int k = 0; k < vec_len; k++) {
for (int i = 0; i < 4; i++) {
for (int j = 0; j < 4; j++) {
acc[i][j] += shA[tx * 4 + i][k] * shB[ty * 4 + j][k];
}
}
}
__syncthreads();
}
// Write output with scaling
write_output(C, acc, block_x, block_y, tx, ty, M, N,
scale_a, scale_b, output_scale);
}
Instruction-Level Optimization
Loop unrolling reduces pipeline stalls. Manually unroll inner loops when trip counts are known or bounded:
#pragma unroll 8
for (int i = 0; i < 8; i++) {
int idx = base_idx + i;
if (idx < K) {
int8_t a_val = ldg(&A[idx * K + row_a]);
int8_t b_val = ldg(&B[col_b * K + idx]);
accumulator += convert_to_float(a_val) * convert_to_float(b_val);
}
}
Warp-level primitives enable efficient reductions and data movement without synchronization barriers.
EXERCISE
Profile a quantized GEMM kernel with nvtx ranges and identify the limiting factor using ncu metrics. Implement one optimization and measure the improvement.