How to quantize a model for llama.cpp
llama.cpp compiled with quantization support, base model in FP16 format
What this does
Converts a full-precision model (FP16) into a quantized GGUF format, reducing file size and memory footprint at the cost of some accuracy. The resulting file runs efficiently on consumer hardware.
Steps
Identify the model type for quantization. Run the help command to list supported types.
./llama-quantize --helpExpected output: List of available quantization types such as
Q4_K_M,Q5_K_S,Q8_0.Run the quantize binary on the FP16 model file. Specify the source file, destination file, and the target quantization type.
./llama-quantize /path/to/model-fp16.gguf /path/to/model-Q4_K_M.gguf Q4_K_MExpected output: Progress bar showing layer-by-layer conversion and final file size comparison.
Verify the output file is a valid GGUF with reduced size. The new file should be 3-6x smaller than the FP16 source.
ls -lh /path/to/model-Q4_K_M.ggufExpected output: File size significantly smaller than the FP16 original, typically under 8 GB for a 7B model at Q4_K_M.
- Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Verification
./llama-quantize --dry-run /path/to/model-Q4_K_M.gguf
# Expected: dry-run output confirming quantization type and layer stats without error
Common failures
- Unsupported quantization type selected — The model architecture does not support the chosen type. Use a lower quantization level such as
Q8_0. - Out of disk space during conversion — Temporary files require space equal to the input file. Free sufficient space or work from a directory with adequate storage.
- Model not in GGUF format — Convert raw model files to GGUF first using
convert.pyin the llama.cpp repository before quantization. - Quantized model produces garbled output — Aggressive quantization (below Q4) may degrade output quality. Retry with
Q5_K_SorQ4_K_M. - Quantization hangs at a specific layer — Interrupt and restart with a less aggressive quantization type.