06. Quantization Quality Tradeoffs

Chapter 6 of 18 · 15 min

Quantization quality measurement requires more than memory savings reporting. Perplexity—the model's uncertainty when predicting text—provides a standardized metric. However, perplexity alone doesn't capture task-specific performance.

Standard evaluation datasets:

  • WikiText-2/3: Generative perplexity benchmarks
  • C-Eval: Chinese multiple-choice evaluation
  • HumanEval: Python code completion
  • MMLU: 57-subject multiple-choice test

Create a benchmark suite reflecting actual use cases. A code generation model should be evaluated on code tasks, not just perplexity.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def evaluate_perplexity(model_name, quant_path=None):
    from datasets import load_dataset
    
    model = AutoModelForCausalLM.from_pretrained(
        quant_path or model_name,
        torch_dtype=torch.float16,
        device_map="auto"
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    data = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
    encodings = tokenizer("\n\n".join(data["text"]), return_tensors="pt")
    
    max_length = 1024
    stride = 512
    
    seq_len = encodings.input_ids.size(1)
    nlls = []
    
    for i in range(0, seq_len, stride):
        begin_loc = i
        end_loc = min(i + max_length, seq_len)
        target_ids = encodings.input_ids[:, begin_loc:end_loc].to("cuda")
        
        with torch.no_grad():
            outputs = model(target_ids)
            neg_log_likelihood = outputs.loss
        
        nlls.append(neg_log_likelihood)
    
    ppl = torch.exp(torch.stack(nlls).mean()).item()
    return ppl

Expected quality degradation at different bit widths:

Format Bits Relative Perplexity Acceptable?
FP16 16 1.00 (baseline) Yes
GPTQ 8 1.02-1.05 Yes
GGUF Q8 8 1.03-1.06 Yes
GPTQ 4 1.05-1.10 Usually
AWQ 4 1.04-1.08 Usually
GGUF Q4_K_M 4 1.05-1.12 Usually
GGUF Q3_K_M 3 1.10-1.20 Marginal
GGUF Q2_K 2 1.15-1.30+ Problematic

Beyond perplexity, task-specific degradation varies. Instruction following degrades more than text completion at aggressive quantization. Math capability suffers severely. Logical reasoning holds up reasonably well.

Critical factors affecting quantization quality:

Calibration data alignment: Using out-of-domain calibration samples produces worse results. A model trained on code that uses Wikipedia for calibration will quantize poorly on code tasks.

Model architecture: Some architectures quantize better than others. Models designed with quantization-aware training (Llama 3, Mistral) hold up better than those without (early Llama 1 models).

Group size: Smaller groups (64 vs 128) preserve quality but slightly increase model size. The quality-per-memory tradeoff is favorable for most use cases at group_size=128.

Descent activation ordering: GPTQ's desc_act option typically improves results 1-3 perplexity points but may slow inference 10-20% on some architectures.

EXERCISE

Evaluate a 4-bit quantized model on both perplexity and a task relevant to your use case (code generation, instruction following, etc.). Compare results. Document which capability suffers most from quantization.