RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Custom Quantization and Kernels
  6. /Ch. 17
Custom Quantization and Kernels

17. Integration with Runtimes

Chapter 17 of 18 · 15 min
KEY INSIGHT

Runtime integration requires careful memory management. Use preallocated buffers and avoid tensor copies in hot paths.

Custom kernels integrate with inference runtimes through standardized interfaces, enabling deployment in production environments.

ONNX Runtime Integration

ONNX Runtime uses execution providers and custom operators for extensibility:

import onnxruntime as ort
from onnxruntime.capi import _ld_preload  # Preload CUDA libraries

class QuantizedOp OrtCustomOp:
    def __init__(self, kernel_library_path):
        self.kernel_lib = kernel_library_path
        
    def create_kernel(self, session_options, provider_options):
        return QuantizedKernel(self.kernel_lib)

class QuantizedKernel:
    def __init__(self, lib_path):
        self.lib = ctypes.CDLL(lib_path)
        
    def compute(self, args, outputs):
        # args: list of numpy arrays
        # outputs: list of numpy arrays (preallocated)
        a_int8, b_int8, scale_a, scale_b, out_scale = args[:5]
        c_int32 = outputs[0]
        
        # Launch kernel
        launch_quantized_gemm(
            a_int8.ctypes.data_as(ctypes.POINTER(ctypes.c_int8)),
            b_int8.ctypes.data_as(ctypes.POINTER(ctypes.c_int8)),
            c_int32.ctypes.data_as(ctypes.POINTER(ctypes.c_int32)),
            scale_a.ctypes.data_as(ctypes.POINTER(ctypes.c_float)),
            scale_b.ctypes.data_as(ctypes.POINTER(ctypes.c_float)),
            out_scale
        )

# Register custom op
ort_session_options = ort.SessionOptions()
ort_session_options.register_custom_ops_library("lib/quantized_ops.so")

# Usage
session = ort.InferenceSession("model.onnx", sess_options=ort_session_options,
                               providers=['CUDAExecutionProvider'])

TorchScript Integration

import torch
from torch.utils.cpp_extension import load_inline

quantized_source = """
#include <torch/extension.h>
#include <cuda_runtime.h>

torch::Tensor quantized_gemm(
    torch::Tensor a, torch::Tensor b,
    torch::Tensor scale_a, torch::Tensor scale_b,
    float output_scale) {
    
    auto c = torch::zeros({a.size(0), b.size(1)}, 
                         torch::kInt32, a.device());
    
    quantized_gemm_kernel(
        a.data_ptr<int8_t>(),
        b.data_ptr<int8_t>(),
        c.data_ptr<int32_t>(),
        scale_a.data_ptr<float>(),
        scale_b.data_ptr<float>(),
        output_scale
    );
    
    return c;
}
"""

quantized_module = load_inline(
    name='quantized_ops',
    cpp_sources=quantized_source,
    cuda_sources=cuda_kernel_source,
    functions=['quantized_gemm'],
    verbose=True
)

# Wrap in TorchScript-compatible class
class QuantizedGEMMWrapper(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.module = quantized_module
        
    def forward(self, a, b, scale_a, scale_b, output_scale):
        return self.module.quantized_gemm(a, b, scale_a, scale_b, output_scale)

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Integrate your custom quantized kernels into ONNX Runtime and benchmark against standard operators on a transformer model.

← Chapter 16
Deploying Custom Kernels
Chapter 18 →
Custom Quantization Project