Large language models

GGUF

GGUF (GGML Unified Format) is the file format used by llama.cpp and its ecosystem (Ollama, KoboldCPP, LM Studio). A single file contains the quantized model weights, tokenizer, and metadata — no separate config files needed. Replaced the older GGML format in late 2023.

Quantization variants live as suffixes: Q4_K_M, Q5_K_M, Q8_0, F16, etc. The K-quants (Q4_K_M, Q5_K_M) are mixed-precision — different layers get different bit widths based on sensitivity. Q4_K_M, despite the name, averages 4.83 bits per parameter because attention layers stay at 6-bit.

GGUF is single-file, mmap-friendly (the OS pages model weights as needed instead of loading everything upfront), and runs on every platform llama.cpp supports — including phones and Raspberry Pis. The file extension is universal: model.Q4_K_M.gguf.

Related terms

Quantization

Related terms

See also