06. QLoRA: Quantized LoRA

Chapter 6 of 24 · 20 min

QLoRA (Quantized Low-Rank Adaptation) enables fine-tuning of large models on consumer hardware by combining two techniques: 4-bit quantization of the frozen base model and LoRA adaptation trained in higher precision. This combination achieves fine-tuning quality comparable to full 16-bit fine-tuning while dramatically reducing memory requirements.

The key innovation in QLoRA involves not just quantization but a specific quantization format called 4-bit NormalFloat (NF4) designed for normally distributed weights. Additionally, QLoRA uses double quantizationâ€”quantizing the quantization constants themselvesâ€”to further reduce memory overhead.

When fine-tuning with QLoRA, the base model weights are loaded in 4-bit NF4 format and kept frozen. LoRA adapters are stored in bf16 format and updated during training. At each forward pass, weights are dequantized on-the-fly for computation, re-quantized afterward, and then combined with the LoRA adaptation.

Memory savings come primarily from the base model quantization. A 7B parameter model in bf16 requires approximately 14GB just for weights. The same model in 4-bit NF4 requires roughly 3.5GB. This enables loading 7B models on consumer GPUs with 8-12GB of VRAM, previously impossible with full-precision approaches.

The quantization introduces a small quality degradation in the base model representation, but the LoRA adaptation can compensate for this degradation through training. The combination achieves results comparable to full-precision fine-tuning in many benchmarks, particularly when sufficient training data is available.

Implementation through libraries like bitsandbytes and PEFT provides turnkey support for QLoRA configurations. The primary configuration choices involve quantization bits (4-bit is standard), block size for quantization granularity, and compute dtype for the LoRA operations.

EXERCISE

Configure a QLoRA training setup for a 7B model. Calculate expected memory requirements and verify GPU compatibility.

# qlora_memory_calc.py
import torch

def calculate_qlora_memory(
    model_size_b: float,
    quantization_bits: int = 4,
    rank: int = 8,
    num_layers: int = 32,
    batch_size: int = 4,
    seq_length: int = 512,
    precision: str = "bf16"
) -> dict:
    """Calculate memory requirements for QLoRA training."""
    
    # Base model in quantized format
    base_bytes_per_param = quantization_bits / 8
    base_model_bytes = model_size_b * 1e9 * base_bytes_per_param
    
    # LoRA parameters (higher precision)
    lora_bytes_per_param = 2 if precision == "bf16" else 4
    lora_params = 2 * rank * 4096 * num_layers * 2  # Q and V only
    lora_bytes = lora_params * lora_bytes_per_param
    
    # Optimizer states for LoRA only (32-bit)
    optimizer_bytes = lora_params * 4
    
    # Gradient states (same precision as compute)
    gradient_bytes = lora_params * (2 if precision == "bf16" else 4)
    
    # Activations (rough estimate)
    # Depends heavily on batch size and sequence length
    activations_per_sample = num_layers * 4096 * seq_length * 4  # 4 bytes
    activation_bytes = batch_size * activations_per_sample
    
    # KV cache during training
    kv_cache_bytes = 2 * batch_size * num_layers * 64 * seq_length * 512 * 2
    
    return {
        "base_model_gb": base_model_bytes / 1e9,
        "lora_params_mb": lora_bytes / 1e6,
        "optimizer_mb": optimizer_bytes / 1e6,
        "gradients_mb": gradient_bytes / 1e6,
        "activations_gb": activation_bytes / 1e9,
        "kv_cache_gb": kv_cache_bytes / 1e9,
        "total_estimated_gb": (base_model_bytes + lora_bytes + 
                               optimizer_bytes + gradient_bytes + 
                               activation_bytes + kv_cache_bytes) / 1e9
    }

# 7B model example
memory = calculate_qlora_memory(
    model_size_b=7.0,
    quantization_bits=4,
    rank=8,
    num_layers=32,
    batch_size=4,
    seq_length=512
)

print("Memory Breakdown for 7B QLoRA Training:")
print(f"  Base model (4-bit):  {memory['base_model_gb']:.2f} GB")
print(f"  LoRA parameters:     {memory['lora_params_mb']:.2f} MB")
print(f"  Optimizer states:    {memory['optimizer_mb']:.2f} MB")
print(f"  Gradients:           {memory['gradients_mb']:.2f} MB")
print(f"  Activations:         {memory['activations_gb']:.2f} GB")
print(f"  KV Cache:            {memory['kv_cache_gb']:.2f} GB")
print(f"  Total estimated:    {memory['total_estimated_gb']:.2f} GB")

# Check GPU compatibility
def check_gpu_compatibility(required_gb: float, safety_margin: float = 1.2):
    """Check if GPU can handle the memory requirement."""
    if torch.cuda.is_available():
        gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
        with_padding = required_gb * safety_margin
        return {
            "gpu_total_gb": gpu_memory,
            "required_with_margin_gb": with_padding,
            "compatible": gpu_memory >= with_padding
        }
    return {"gpu_available": False}

print(f"\nGPU Check: {check_gpu_compatibility(memory['total_estimated_gb'])}")