19. GGUF Conversion

Chapter 19 of 24 · 20 min

KEY INSIGHT

GGUF (GPT Generative Unified Format) is a quantized model format designed for efficient inference. Converting fine-tuned adapters to GGUF allows deployment on CPU and low-memory systems without sacrificing the specialized capabilities learned during fine-tuning. The conversion pipeline merges adapters into the base model, then quantizes to the target precision. The resulting file can be memory-mapped for zero-copy loading. ```bash # Install llama.cpp tools git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp mkdir build && cd build cmake .. make -j$(nproc) ``` ```python import subprocess from pathlib import Path def convert_to_gguf(base_model_path, adapter_path, output_path, quantization="Q4_K_M"): """ Convert a LoRA fine-tuned model to GGUF format. Steps: 1. Merge adapter into base model 2. Export to GGUF format 3. Quantize to target precision """ merged_path = Path("./tmp_merged") # Step 1: Merge base_model = AutoModelForCausalLM.from_pretrained(base_model_path) adapter_model = PeftModel.from_pretrained(base_model, adapter_path) merged_model = adapter_model.merge_and_unload() merged_model.save_pretrained(merged_path) # Step 2: Export to GGUF using llama.cpp cmd = [ "./llama.cpp/convert.py", str(merged_path), "--outfile", f"{output_path}.fp16.gguf", "--outtype", "f16" ] subprocess.run(cmd, check=True) # Step 3: Quantize quantize_cmd = [ "./llama.cpp/build/bin/quantize", f"{output_path}.fp16.gguf", f"{output_path}.{quantization}.gguf", quantization ] subprocess.run(quantize_cmd, check=True) # Cleanup shutil.rmtree(merged_path) Path(f"{output_path}.fp16.gguf").unlink() ``` **Quantization levels** trade accuracy for size: | Format | Size (7B) | Accuracy | |--------|-----------|----------| | FP16 | ~14GB | Baseline | | Q5_K_M | ~4.5GB | ~98% | | Q4_K_M | ~3.8GB | ~97% | | Q3_K_M | ~3.0GB | ~95% | | Q2_K | ~2.7GB | ~93% | ```bash # Manual conversion workflow python llama.cpp/convert.py ./merged_model --outfile model_fp16.gguf --outtype f16 ./llama.cpp/build/bin/quantize model_fp16.gguf model_q4_k_m.gguf Q4_K_M ```

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

: Benchmark Quantization Impact

Compare inference results across quantization levels:

from llama_cpp import Llama
import numpy as np

def benchmark_quantization(model_paths, test_prompt, tokenizer):
    """Measure latency and output quality across quantizations."""
    results = {}
    
    for path in model_paths:
        model = Llama(model_path=str(path), n_ctx=2048)
        output = model(test_prompt, max_tokens=100)
        
        results[path.name] = {
            "latency": output["timings"]["eval_duration"] / 1e9,
            "tokens": len(output["choices"][0]["text"].split())
        }
    
    return results