RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Custom Quantization and Kernels
  6. /Ch. 5
Custom Quantization and Kernels

05. GGUF Format Deep Dive

Chapter 5 of 18 · 20 min
KEY INSIGHT

GGUF's explicit type system and self-describing tensor structure enable inference engines to correctly interpret any supported quantization format without external configuration files.

GGUF (Generic Graph Unified Format) serves as the standard container format for local large language models, providing a thorough specification for storing quantized models with all necessary metadata for inference engines.

The format organizes data into key-value pairs and tensors stored in a hierarchical structure. Type safety is guaranteed through explicit type tags, preventing the misinterpretation of data by inference engines. The specification supports versioning that enables backward compatibility as the format evolves.

import struct
import numpy as np

class GGUFWriter:
    """Write models in GGUF format."""
    
    MAGIC = 0x46554747  # "GGUF" in little-endian
    SUPPORTED_VERSIONS = [(3, 0)]
    
    # Type enumeration
    DTYPE_UINT8 = 0
    DTYPE_INT8 = 1
    DTYPE_INT16 = 2
    DTYPE_INT32 = 3
    DTYPE_F32 = 4
    DTYPE_F16 = 5
    DTYPE_BF16 = 6
    DTYPE_Q4_0 = 7
    DTYPE_Q4_1 = 8
    DTYPE_Q5_0 = 9
    DTYPE_Q5_1 = 10
    DTYPE_Q8_0 = 11
    DTYPE_Q2_K = 12
    DTYPE_Q3_K = 13
    DTYPE_Q4_K = 14
    DTYPE_Q5_K = 15
    DTYPE_Q6_K = 16
    DTYPE_Q8_1 = 17
    
    def __init__(self, path):
        self.path = path
        self.metadata = {}
        self.tensors = []
    
    def add_key_value(self, key, value_type, value):
        """Add a key-value metadata pair."""
        self.metadata[key] = (value_type, value)
    
    def add_tensor(self, name, data, tensor_type):
        """Add a tensor with its quantization type."""
        self.tensors.append({
            'name': name,
            'data': data,
            'type': tensor_type,
            'shape': np.array(data.shape),
            'n_elements': np.prod(data.shape),
            'offloads': []
        })
    
    def write(self):
        """Write GGUF file to disk."""
        with open(self.path, 'wb') as f:
            # Magic number and version
            f.write(struct.pack('<I', self.MAGIC))
            f.write(struct.pack('<I', 3))  # version 3
            f.write(struct.pack('<I', 3))  # tensor count
            f.write(struct.pack('<I', len(self.metadata)))
            
            # Metadata section
            for key, (value_type, value) in self.metadata.items():
                self._write_string(f, key)
                f.write(struct.pack('<I', value_type))
                self._write_value(f, value_type, value)
            
            # Tensor data (padded to 32-byte alignment)
            tensor_data_start = f.tell()
            
            for tensor in self.tensors:
                data_bytes = tensor['data'].astype(np.int8).tobytes()
                padded_size = ((len(data_bytes) + 31) // 32) * 32
                f.write(data_bytes)
                f.write(b'\x00' * (padded_size - len(data_bytes)))

GGUF distinguishes between metadata and tensor storage, separating configuration from weights. Metadata includes architecture type, quantization parameters, vocabulary information, and model dimensions. Tensor data stores weights with explicit shapes and quantization types, making the format self-describing.

Each tensor declares its quantization type explicitly, enabling inference engines to select appropriate dequantization kernels. This explicit typing eliminates ambiguities that plagued earlier formats and simplifies the implementation of custom quantization schemes.

    def _write_string(self, file, s):
        """Write a length-prefixed string."""
        encoded = s.encode('utf-8')
        file.write(struct.pack('<Q', len(encoded)))
        file.write(encoded)
    
    def _write_value(self, file, value_type, value):
        """Write a typed value."""
        if value_type == 4:  # string
            self._write_string(file, value)
        elif value_type == 8:  # uint32
            file.write(struct.pack('<I', value))
        elif value_type == 5:  # int32
            file.write(struct.pack('<i', value))
        elif value_type == 10:  # float32
            file.write(struct.pack('<f', value))
        elif value_type == 11:  # bool
            file.write(struct.pack('<?', value))
        elif value_type == 6:  # uint64
            file.write(struct.pack('<Q', value))

Quantization parameters exist at multiple granularity levels. Per-tensor scales apply to entire weight matrices if the model uses uniform quantization. Per-channel scales attach to dimension-specific metadata, following the model architecture's channel definitions. The format accommodates arbitrary quantization schemes through metadata extending the core specification.

EXERCISE

Implement a GGUF metadata parser that reads the header, extracts all key-value pairs, and lists tensor names with their shapes and quantization types. Verify compatibility with a real quantized model file if available.

← Chapter 4
Calibration Datasets
Chapter 6 →
Custom Quant Schemes