RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Custom Quantization and Kernels
  6. /Ch. 16
Custom Quantization and Kernels

16. Deploying Custom Kernels

Chapter 16 of 18 · 20 min
KEY INSIGHT

Version control for kernels is critical—ensure CUDA version compatibility and provide fallback implementations for unsupported architectures.

Deployment involves packaging kernels with runtime dependencies, ensuring compatibility across target environments, and managing versioning.

Kernel Library Structure

quantized_kernels/
├── include/
│   ├── quantized_ops.h
│   └── kernel_config.h
├── src/
│   ├── gemm_int8.cu
│   ├── attention_fp8.cu
│   └── quantization_utils.cu
├── lib/
│   └── libquantized_kernels.a
├── python/
│   ├── __init__.py
│   └── _kernel_bindings.so
├── setup.py
└── README.md

C++ Kernel Interface

// include/quantized_ops.h
#pragma once

#include <cuda_runtime.h>
#include <cstdint>

namespace qk {

struct GemmConfig {
    int m, n, k;
    const int8_t* a;
    const int8_t* b;
    int32_t* c;
    const float* scale_a;
    const float* scale_b;
    float output_scale;
    cudaStream_t stream;
};

enum class QuantizationMode {
    Symmetric,
    Asymmetric,
    Block,
    GPTQ
};

struct QuantizedTensor {
    void* data;
    std::vector<int64_t> shape;
    QuantizationMode mode;
    std::vector<float> scales;
    std::vector<int32_t> zero_points;
    int32_t bits;
};

// Main kernel entry point
cudaError_t quantized_gemm(const GemmConfig& config);

// Utility functions
std::unique_ptr<QuantizedTensor> quantize(
    const float* data, const std::vector<int64_t>& shape,
    QuantizationMode mode, const std::vector<float>& scales);

std::unique_ptr<float[]> dequantize(
    const QuantizedTensor& tensor);

}  // namespace qk

Python Bindings

# python/quantized_ops.py
import ctypes
import torch

class QuantizedGEMM:
    def __init__(self, lib_path='lib/libquantized_kernels.so'):
        self.lib = ctypes.CDLL(lib_path)
        
        self.lib.quantized_gemm.argtypes = [
            ctypes.c_int, ctypes.c_int, ctypes.c_int,
            ctypes.POINTER(ctypes.c_int8),
            ctypes.POINTER(ctypes.c_int8),
            ctypes.POINTER(ctypes.c_int32),
            ctypes.POINTER(ctypes.c_float),
            ctypes.POINTER(ctypes.c_float),
            ctypes.c_float
        ]
        self.lib.quantized_gemm.restype = ctypes.c_int
        
    def forward(self, a: torch.Tensor, b: torch.Tensor,
                scale_a: torch.Tensor, scale_b: torch.Tensor,
                output_scale: float) -> torch.Tensor:
        """Execute quantized GEMM with automatic tensor management."""
        assert a.is_cuda and b.is_cuda
        assert a.dtype == torch.int8 and b.dtype == torch.int8
        
        m, k = a.shape
        k2, n = b.shape
        assert k == k2
        
        c = torch.zeros((m, n), dtype=torch.int32, device='cuda')
        
        self.lib.quantized_gemm(
            m, n, k,
            a.data_ptr(ctypes.c_int8),
            b.data_ptr(ctypes.c_int8),
            c.data_ptr(ctypes.c_int32),
            scale_a.data_ptr(ctypes.c_float),
            scale_b.data_ptr(ctypes.c_float),
            ctypes.c_float(output_scale)
        )
        
        return c

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Create a deployable Python package with your quantized kernels, including automated tests and documentation generation.

← Chapter 15
Quantization Accuracy
Chapter 17 →
Integration with Runtimes