HOW-TO · SET

How to quantize a model for llama.cpp

advanced30 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.xWindows 11 · Ollama 0.4.xmacOS 15 · Ollama 0.4.x
PREREQUISITES

llama.cpp compiled with quantization support, base model in FP16 format

What this does

Converts a full-precision model (FP16) into a quantized GGUF format, reducing file size and memory footprint at the cost of some accuracy. The resulting file runs efficiently on consumer hardware.

Steps

  1. Identify the model type for quantization. Run the help command to list supported types.

    ./llama-quantize --help
    

    Expected output: List of available quantization types such as Q4_K_M, Q5_K_S, Q8_0.

  2. Run the quantize binary on the FP16 model file. Specify the source file, destination file, and the target quantization type.

    ./llama-quantize /path/to/model-fp16.gguf /path/to/model-Q4_K_M.gguf Q4_K_M
    

    Expected output: Progress bar showing layer-by-layer conversion and final file size comparison.

  3. Verify the output file is a valid GGUF with reduced size. The new file should be 3-6x smaller than the FP16 source.

    ls -lh /path/to/model-Q4_K_M.gguf
    

    Expected output: File size significantly smaller than the FP16 original, typically under 8 GB for a 7B model at Q4_K_M.

  • Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

Verification

./llama-quantize --dry-run /path/to/model-Q4_K_M.gguf
# Expected: dry-run output confirming quantization type and layer stats without error

Common failures

  • Unsupported quantization type selected — The model architecture does not support the chosen type. Use a lower quantization level such as Q8_0.
  • Out of disk space during conversion — Temporary files require space equal to the input file. Free sufficient space or work from a directory with adequate storage.
  • Model not in GGUF format — Convert raw model files to GGUF first using convert.py in the llama.cpp repository before quantization.
  • Quantized model produces garbled output — Aggressive quantization (below Q4) may degrade output quality. Retry with Q5_K_S or Q4_K_M.
  • Quantization hangs at a specific layer — Interrupt and restart with a less aggressive quantization type.

Related guides

RELATED GUIDES