Deploying Custom Kernels — Custom Quantization and Kernels (Chapter 16)

Deployment involves packaging kernels with runtime dependencies, ensuring compatibility across target environments, and managing versioning.

Kernel Library Structure

quantized_kernels/
├── include/
│   ├── quantized_ops.h
│   └── kernel_config.h
├── src/
│   ├── gemm_int8.cu
│   ├── attention_fp8.cu
│   └── quantization_utils.cu
├── lib/
│   └── libquantized_kernels.a
├── python/
│   ├── __init__.py
│   └── _kernel_bindings.so
├── setup.py
└── README.md

C++ Kernel Interface

// include/quantized_ops.h
#pragma once

#include <cuda_runtime.h>
#include <cstdint>

namespace qk {

struct GemmConfig {
    int m, n, k;
    const int8_t* a;
    const int8_t* b;
    int32_t* c;
    const float* scale_a;
    const float* scale_b;
    float output_scale;
    cudaStream_t stream;
};

enum class QuantizationMode {
    Symmetric,
    Asymmetric,
    Block,
    GPTQ
};

struct QuantizedTensor {
    void* data;
    std::vector<int64_t> shape;
    QuantizationMode mode;
    std::vector<float> scales;
    std::vector<int32_t> zero_points;
    int32_t bits;
};

// Main kernel entry point
cudaError_t quantized_gemm(const GemmConfig& config);

// Utility functions
std::unique_ptr<QuantizedTensor> quantize(
    const float* data, const std::vector<int64_t>& shape,
    QuantizationMode mode, const std::vector<float>& scales);

std::unique_ptr<float[]> dequantize(
    const QuantizedTensor& tensor);

}  // namespace qk

Python Bindings

# python/quantized_ops.py
import ctypes
import torch

class QuantizedGEMM:
    def __init__(self, lib_path='lib/libquantized_kernels.so'):
        self.lib = ctypes.CDLL(lib_path)
        
        self.lib.quantized_gemm.argtypes = [
            ctypes.c_int, ctypes.c_int, ctypes.c_int,
            ctypes.POINTER(ctypes.c_int8),
            ctypes.POINTER(ctypes.c_int8),
            ctypes.POINTER(ctypes.c_int32),
            ctypes.POINTER(ctypes.c_float),
            ctypes.POINTER(ctypes.c_float),
            ctypes.c_float
        ]
        self.lib.quantized_gemm.restype = ctypes.c_int
        
    def forward(self, a: torch.Tensor, b: torch.Tensor,
                scale_a: torch.Tensor, scale_b: torch.Tensor,
                output_scale: float) -> torch.Tensor:
        """Execute quantized GEMM with automatic tensor management."""
        assert a.is_cuda and b.is_cuda
        assert a.dtype == torch.int8 and b.dtype == torch.int8
        
        m, k = a.shape
        k2, n = b.shape
        assert k == k2
        
        c = torch.zeros((m, n), dtype=torch.int32, device='cuda')
        
        self.lib.quantized_gemm(
            m, n, k,
            a.data_ptr(ctypes.c_int8),
            b.data_ptr(ctypes.c_int8),
            c.data_ptr(ctypes.c_int32),
            scale_a.data_ptr(ctypes.c_float),
            scale_b.data_ptr(ctypes.c_float),
            ctypes.c_float(output_scale)
        )
        
        return c

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.