16. Deploying Custom Kernels
Chapter 16 of 18 · 20 min
Deployment involves packaging kernels with runtime dependencies, ensuring compatibility across target environments, and managing versioning.
Kernel Library Structure
quantized_kernels/
├── include/
│ ├── quantized_ops.h
│ └── kernel_config.h
├── src/
│ ├── gemm_int8.cu
│ ├── attention_fp8.cu
│ └── quantization_utils.cu
├── lib/
│ └── libquantized_kernels.a
├── python/
│ ├── __init__.py
│ └── _kernel_bindings.so
├── setup.py
└── README.md
C++ Kernel Interface
// include/quantized_ops.h
#pragma once
#include <cuda_runtime.h>
#include <cstdint>
namespace qk {
struct GemmConfig {
int m, n, k;
const int8_t* a;
const int8_t* b;
int32_t* c;
const float* scale_a;
const float* scale_b;
float output_scale;
cudaStream_t stream;
};
enum class QuantizationMode {
Symmetric,
Asymmetric,
Block,
GPTQ
};
struct QuantizedTensor {
void* data;
std::vector<int64_t> shape;
QuantizationMode mode;
std::vector<float> scales;
std::vector<int32_t> zero_points;
int32_t bits;
};
// Main kernel entry point
cudaError_t quantized_gemm(const GemmConfig& config);
// Utility functions
std::unique_ptr<QuantizedTensor> quantize(
const float* data, const std::vector<int64_t>& shape,
QuantizationMode mode, const std::vector<float>& scales);
std::unique_ptr<float[]> dequantize(
const QuantizedTensor& tensor);
} // namespace qk
Python Bindings
# python/quantized_ops.py
import ctypes
import torch
class QuantizedGEMM:
def __init__(self, lib_path='lib/libquantized_kernels.so'):
self.lib = ctypes.CDLL(lib_path)
self.lib.quantized_gemm.argtypes = [
ctypes.c_int, ctypes.c_int, ctypes.c_int,
ctypes.POINTER(ctypes.c_int8),
ctypes.POINTER(ctypes.c_int8),
ctypes.POINTER(ctypes.c_int32),
ctypes.POINTER(ctypes.c_float),
ctypes.POINTER(ctypes.c_float),
ctypes.c_float
]
self.lib.quantized_gemm.restype = ctypes.c_int
def forward(self, a: torch.Tensor, b: torch.Tensor,
scale_a: torch.Tensor, scale_b: torch.Tensor,
output_scale: float) -> torch.Tensor:
"""Execute quantized GEMM with automatic tensor management."""
assert a.is_cuda and b.is_cuda
assert a.dtype == torch.int8 and b.dtype == torch.int8
m, k = a.shape
k2, n = b.shape
assert k == k2
c = torch.zeros((m, n), dtype=torch.int32, device='cuda')
self.lib.quantized_gemm(
m, n, k,
a.data_ptr(ctypes.c_int8),
b.data_ptr(ctypes.c_int8),
c.data_ptr(ctypes.c_int32),
scale_a.data_ptr(ctypes.c_float),
scale_b.data_ptr(ctypes.c_float),
ctypes.c_float(output_scale)
)
return c
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
EXERCISE
Create a deployable Python package with your quantized kernels, including automated tests and documentation generation.