GGUF Format Deep Dive — Custom Quantization and Kernels (Chapter 5)

GGUF (Generic Graph Unified Format) serves as the standard container format for local large language models, providing a thorough specification for storing quantized models with all necessary metadata for inference engines.

The format organizes data into key-value pairs and tensors stored in a hierarchical structure. Type safety is guaranteed through explicit type tags, preventing the misinterpretation of data by inference engines. The specification supports versioning that enables backward compatibility as the format evolves.

import struct
import numpy as np

class GGUFWriter:
    """Write models in GGUF format."""
    
    MAGIC = 0x46554747  # "GGUF" in little-endian
    SUPPORTED_VERSIONS = [(3, 0)]
    
    # Type enumeration
    DTYPE_UINT8 = 0
    DTYPE_INT8 = 1
    DTYPE_INT16 = 2
    DTYPE_INT32 = 3
    DTYPE_F32 = 4
    DTYPE_F16 = 5
    DTYPE_BF16 = 6
    DTYPE_Q4_0 = 7
    DTYPE_Q4_1 = 8
    DTYPE_Q5_0 = 9
    DTYPE_Q5_1 = 10
    DTYPE_Q8_0 = 11
    DTYPE_Q2_K = 12
    DTYPE_Q3_K = 13
    DTYPE_Q4_K = 14
    DTYPE_Q5_K = 15
    DTYPE_Q6_K = 16
    DTYPE_Q8_1 = 17
    
    def __init__(self, path):
        self.path = path
        self.metadata = {}
        self.tensors = []
    
    def add_key_value(self, key, value_type, value):
        """Add a key-value metadata pair."""
        self.metadata[key] = (value_type, value)
    
    def add_tensor(self, name, data, tensor_type):
        """Add a tensor with its quantization type."""
        self.tensors.append({
            'name': name,
            'data': data,
            'type': tensor_type,
            'shape': np.array(data.shape),
            'n_elements': np.prod(data.shape),
            'offloads': []
        })
    
    def write(self):
        """Write GGUF file to disk."""
        with open(self.path, 'wb') as f:
            # Magic number and version
            f.write(struct.pack('<I', self.MAGIC))
            f.write(struct.pack('<I', 3))  # version 3
            f.write(struct.pack('<I', 3))  # tensor count
            f.write(struct.pack('<I', len(self.metadata)))
            
            # Metadata section
            for key, (value_type, value) in self.metadata.items():
                self._write_string(f, key)
                f.write(struct.pack('<I', value_type))
                self._write_value(f, value_type, value)
            
            # Tensor data (padded to 32-byte alignment)
            tensor_data_start = f.tell()
            
            for tensor in self.tensors:
                data_bytes = tensor['data'].astype(np.int8).tobytes()
                padded_size = ((len(data_bytes) + 31) // 32) * 32
                f.write(data_bytes)
                f.write(b'\x00' * (padded_size - len(data_bytes)))

GGUF distinguishes between metadata and tensor storage, separating configuration from weights. Metadata includes architecture type, quantization parameters, vocabulary information, and model dimensions. Tensor data stores weights with explicit shapes and quantization types, making the format self-describing.

Each tensor declares its quantization type explicitly, enabling inference engines to select appropriate dequantization kernels. This explicit typing eliminates ambiguities that plagued earlier formats and simplifies the implementation of custom quantization schemes.

    def _write_string(self, file, s):
        """Write a length-prefixed string."""
        encoded = s.encode('utf-8')
        file.write(struct.pack('<Q', len(encoded)))
        file.write(encoded)
    
    def _write_value(self, file, value_type, value):
        """Write a typed value."""
        if value_type == 4:  # string
            self._write_string(file, value)
        elif value_type == 8:  # uint32
            file.write(struct.pack('<I', value))
        elif value_type == 5:  # int32
            file.write(struct.pack('<i', value))
        elif value_type == 10:  # float32
            file.write(struct.pack('<f', value))
        elif value_type == 11:  # bool
            file.write(struct.pack('<?', value))
        elif value_type == 6:  # uint64
            file.write(struct.pack('<Q', value))

Quantization parameters exist at multiple granularity levels. Per-tensor scales apply to entire weight matrices if the model uses uniform quantization. Per-channel scales attach to dimension-specific metadata, following the model architecture's channel definitions. The format accommodates arbitrary quantization schemes through metadata extending the core specification.