Memory Coalescing — Custom Quantization and Kernels (Chapter 9)

Memory coalescing transforms scattered memory access patterns into sequential access, enabling GPUs to service memory requests efficiently through wide transactions. In quantized inference kernels, achieving coalesced access often requires restructuring data layouts or computation order.

GPU memory systems serve 32-byte, 64-byte, or 128-byte transactions depending on architecture. Uncoalesced access serving one thread independently wastes bus bandwidth—the GPU fetches data for each thread individually rather than loading chunks shared across warps.

// Uncoalesced access anti-pattern: strided access per thread
__global__ void bad_pattern_kernel(
    const int8_t* weight,  // row-major storage
    const int8_t* input,
    float* output,
    int rows, int cols
) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int row = tid / cols;
    int col = tid % cols;
    
    // BAD: Each thread accesses column with stride == rows
    if (row < rows && col < cols) {
        output[row * cols + col] = (float)weight[col * rows + row] * input[col];
    }
}

Fixing coalescing issues often involves transposing data layouts or using shared memory as a staging area. For quantized weight matrices where storage is column-major, loading a tile into shared memory enables subsequent computations to access the data row-wise.

// Shared memory tiling for coalesced access
#define TILE_SIZE 32
#define WARP_SIZE 32

__global__ void tiled_quantized_gemv(
    const int8_t* __restrict__ weight_q,      // [cols, rows] column-major
    const int8_t* __restrict__ input_q,
    const float* __restrict__ weight_scale,
    const float* __restrict__ input_scale,
    float* __restrict__ output,
    int rows, int cols
) {
    __shared__ int8_t weight_tile[TILE_OFFSET][TILE_SIZE];
    __shared__ int8_t input_tile[TILE_SIZE];
    
    int output_col = blockIdx.x * TILE_SIZE;
    int local_row = threadIdx.y;
    
    int32_t accum[TILE_SIZE / 4] = {0};  // Vectorized accumulation
    
    // Process weight matrix in column tiles
    for (int tile_base = 0; tile_base < cols; tile_base += TILE_SIZE) {
        // Phase 1: Load weight tile with COLUMN-MAJOR coalescing
        // Each warp loads adjacent columns
        int weight_col = tile_base + threadIdx.y;
        for (int i = 0; i < TILE_SIZE / 4; i++) {
            int row = tile_base + i * 4 + threadIdx.x;
            if (row < rows && weight_col < cols) {
                // This accesses weight[weight_col * rows + row]
                // -> stride of rows, BUT coalesced across warps
                int4 vals = *((int4*)(weight_q + weight_col * rows + row));
                *((int4*)(weight_tile[threadIdx.y]) + i) = vals;
            }
        }
        
        // Phase 2: Load corresponding input tile
        if (threadIdx.y < TILE_SIZE && output_col + threadIdx.y < cols) {
            input_tile[threadIdx.y] = input_q[tile_base + threadIdx.y];
        }
        __syncthreads();
        
        // Phase 3: Compute using ROW access within tile (coalesced from shared)
        for (int k = 0; k < TILE_SIZE; k++) {
            float w = (float)weight_tile[k][local_row] * weight_scale[output_col + local_row];
            float in = (float)input_tile[k] / *input_scale;
            accum[local_row / 4] += w * in;
        }
        __syncthreads();
    }
    
    // Write output with simple coalescing
    if (output_col + local_row < cols) {
        output[output_col + local_row] = reduce_vector(accum[local_row / 4]);
    }
}

Bank conflicts occur when multiple threads access the same shared memory bank simultaneously, serializing access. Shared memory has 32 banks, typically with 4-byte access per bank. Sequential accesses to consecutive 4-byte addresses cause one bank per thread; accessing the same bank across threads creates conflicts.

For quantized kernels, ensuring each thread's shared memory accesses target different banks prevents serialization. Bank pattern awareness influences tile dimensions and access ordering, particularly important when dequantization computations access scale factors adjacent to weight data.

// Bank-safe access pattern for scale + weight access
__shared__ float scale_buffer[32];  // Padded to avoid bank conflicts
__shared__ int8_t weight_buffer[32][33];  // 33 columns breaks bank alignment

__global__ void bank_safe_kernel(
    const int8_t* weight_q,
    const float* scales,
    // ...
) {
    int tid = threadIdx.x;
    int my_scale = scale_buffer[tid];  // Each thread different bank
    
    // Weight access with stride to avoid bank conflicts
    int8_t w0 = weight_buffer[tid / 32][tid % 32];      // thread 0-31
    int8_t w1 = weight_buffer[tid / 32][tid % 32 + 1];  // thread 0-31 different bank
    
    float result = (float)w0 * my_scale + (float)w1 * scale_buffer[(tid + 1) % 32];
    // w0 and w1 avoid same-bank conflict
}

Global memory access efficiency metrics measure actual memory bandwidth utilization. The GPU memory analyser provides this data; achieving 80% efficiency indicates nearly optimal memory access patterns. Quantized kernels frequently fall below this threshold due to their compressed, non-sequential data representations, making coalescing optimization particularly valuable.