What this does

Converts a full-precision model (FP16) into a quantized GGUF format, reducing file size and memory footprint at the cost of some accuracy. The resulting file runs efficiently on consumer hardware.

Steps

Identify the model type for quantization. Run the help command to list supported types.
```
./llama-quantize --help
```
Expected output: List of available quantization types such as Q4_K_M, Q5_K_S, Q8_0.
Run the quantize binary on the FP16 model file. Specify the source file, destination file, and the target quantization type.
```
./llama-quantize /path/to/model-fp16.gguf /path/to/model-Q4_K_M.gguf Q4_K_M
```
Expected output: Progress bar showing layer-by-layer conversion and final file size comparison.
Verify the output file is a valid GGUF with reduced size. The new file should be 3-6x smaller than the FP16 source.
```
ls -lh /path/to/model-Q4_K_M.gguf
```
Expected output: File size significantly smaller than the FP16 original, typically under 8 GB for a 7B model at Q4_K_M.

Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

Verification

./llama-quantize --dry-run /path/to/model-Q4_K_M.gguf
# Expected: dry-run output confirming quantization type and layer stats without error

Common failures

Unsupported quantization type selected — The model architecture does not support the chosen type. Use a lower quantization level such as Q8_0.
Out of disk space during conversion — Temporary files require space equal to the input file. Free sufficient space or work from a directory with adequate storage.
Model not in GGUF format — Convert raw model files to GGUF first using convert.py in the llama.cpp repository before quantization.
Quantized model produces garbled output — Aggressive quantization (below Q4) may degrade output quality. Retry with Q5_K_S or Q4_K_M.
Quantization hangs at a specific layer — Interrupt and restart with a less aggressive quantization type.

How to quantize a model for llama.cpp

What this does

Steps

Verification

Common failures

Related guides