05. GGUF Format Deep Dive
GGUF (Generic Graph Unified Format) serves as the standard container format for local large language models, providing a thorough specification for storing quantized models with all necessary metadata for inference engines.
The format organizes data into key-value pairs and tensors stored in a hierarchical structure. Type safety is guaranteed through explicit type tags, preventing the misinterpretation of data by inference engines. The specification supports versioning that enables backward compatibility as the format evolves.
import struct
import numpy as np
class GGUFWriter:
"""Write models in GGUF format."""
MAGIC = 0x46554747 # "GGUF" in little-endian
SUPPORTED_VERSIONS = [(3, 0)]
# Type enumeration
DTYPE_UINT8 = 0
DTYPE_INT8 = 1
DTYPE_INT16 = 2
DTYPE_INT32 = 3
DTYPE_F32 = 4
DTYPE_F16 = 5
DTYPE_BF16 = 6
DTYPE_Q4_0 = 7
DTYPE_Q4_1 = 8
DTYPE_Q5_0 = 9
DTYPE_Q5_1 = 10
DTYPE_Q8_0 = 11
DTYPE_Q2_K = 12
DTYPE_Q3_K = 13
DTYPE_Q4_K = 14
DTYPE_Q5_K = 15
DTYPE_Q6_K = 16
DTYPE_Q8_1 = 17
def __init__(self, path):
self.path = path
self.metadata = {}
self.tensors = []
def add_key_value(self, key, value_type, value):
"""Add a key-value metadata pair."""
self.metadata[key] = (value_type, value)
def add_tensor(self, name, data, tensor_type):
"""Add a tensor with its quantization type."""
self.tensors.append({
'name': name,
'data': data,
'type': tensor_type,
'shape': np.array(data.shape),
'n_elements': np.prod(data.shape),
'offloads': []
})
def write(self):
"""Write GGUF file to disk."""
with open(self.path, 'wb') as f:
# Magic number and version
f.write(struct.pack('<I', self.MAGIC))
f.write(struct.pack('<I', 3)) # version 3
f.write(struct.pack('<I', 3)) # tensor count
f.write(struct.pack('<I', len(self.metadata)))
# Metadata section
for key, (value_type, value) in self.metadata.items():
self._write_string(f, key)
f.write(struct.pack('<I', value_type))
self._write_value(f, value_type, value)
# Tensor data (padded to 32-byte alignment)
tensor_data_start = f.tell()
for tensor in self.tensors:
data_bytes = tensor['data'].astype(np.int8).tobytes()
padded_size = ((len(data_bytes) + 31) // 32) * 32
f.write(data_bytes)
f.write(b'\x00' * (padded_size - len(data_bytes)))
GGUF distinguishes between metadata and tensor storage, separating configuration from weights. Metadata includes architecture type, quantization parameters, vocabulary information, and model dimensions. Tensor data stores weights with explicit shapes and quantization types, making the format self-describing.
Each tensor declares its quantization type explicitly, enabling inference engines to select appropriate dequantization kernels. This explicit typing eliminates ambiguities that plagued earlier formats and simplifies the implementation of custom quantization schemes.
def _write_string(self, file, s):
"""Write a length-prefixed string."""
encoded = s.encode('utf-8')
file.write(struct.pack('<Q', len(encoded)))
file.write(encoded)
def _write_value(self, file, value_type, value):
"""Write a typed value."""
if value_type == 4: # string
self._write_string(file, value)
elif value_type == 8: # uint32
file.write(struct.pack('<I', value))
elif value_type == 5: # int32
file.write(struct.pack('<i', value))
elif value_type == 10: # float32
file.write(struct.pack('<f', value))
elif value_type == 11: # bool
file.write(struct.pack('<?', value))
elif value_type == 6: # uint64
file.write(struct.pack('<Q', value))
Quantization parameters exist at multiple granularity levels. Per-tensor scales apply to entire weight matrices if the model uses uniform quantization. Per-channel scales attach to dimension-specific metadata, following the model architecture's channel definitions. The format accommodates arbitrary quantization schemes through metadata extending the core specification.
Implement a GGUF metadata parser that reads the header, extracts all key-value pairs, and lists tensor names with their shapes and quantization types. Verify compatibility with a real quantized model file if available.