Quantization Quality Tradeoffs — Model Optimization for Local Inference (Chapter 6)

Quantization quality measurement requires more than memory savings reporting. Perplexity—the model's uncertainty when predicting text—provides a standardized metric. However, perplexity alone doesn't capture task-specific performance.

Standard evaluation datasets:

WikiText-2/3: Generative perplexity benchmarks
C-Eval: Chinese multiple-choice evaluation
HumanEval: Python code completion
MMLU: 57-subject multiple-choice test

Create a benchmark suite reflecting actual use cases. A code generation model should be evaluated on code tasks, not just perplexity.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def evaluate_perplexity(model_name, quant_path=None):
    from datasets import load_dataset
    
    model = AutoModelForCausalLM.from_pretrained(
        quant_path or model_name,
        torch_dtype=torch.float16,
        device_map="auto"
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    data = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
    encodings = tokenizer("\n\n".join(data["text"]), return_tensors="pt")
    
    max_length = 1024
    stride = 512
    
    seq_len = encodings.input_ids.size(1)
    nlls = []
    
    for i in range(0, seq_len, stride):
        begin_loc = i
        end_loc = min(i + max_length, seq_len)
        target_ids = encodings.input_ids[:, begin_loc:end_loc].to("cuda")
        
        with torch.no_grad():
            outputs = model(target_ids)
            neg_log_likelihood = outputs.loss
        
        nlls.append(neg_log_likelihood)
    
    ppl = torch.exp(torch.stack(nlls).mean()).item()
    return ppl

Expected quality degradation at different bit widths:

Format	Bits	Relative Perplexity	Acceptable?
FP16	16	1.00 (baseline)	Yes
GPTQ	8	1.02-1.05	Yes
GGUF Q8	8	1.03-1.06	Yes
GPTQ	4	1.05-1.10	Usually
AWQ	4	1.04-1.08	Usually
GGUF Q4_K_M	4	1.05-1.12	Usually
GGUF Q3_K_M	3	1.10-1.20	Marginal
GGUF Q2_K	2	1.15-1.30+	Problematic

Beyond perplexity, task-specific degradation varies. Instruction following degrades more than text completion at aggressive quantization. Math capability suffers severely. Logical reasoning holds up reasonably well.

Critical factors affecting quantization quality:

Calibration data alignment: Using out-of-domain calibration samples produces worse results. A model trained on code that uses Wikipedia for calibration will quantize poorly on code tasks.

Model architecture: Some architectures quantize better than others. Models designed with quantization-aware training (Llama 3, Mistral) hold up better than those without (early Llama 1 models).

Group size: Smaller groups (64 vs 128) preserve quality but slightly increase model size. The quality-per-memory tradeoff is favorable for most use cases at group_size=128.

Descent activation ordering: GPTQ's desc_act option typically improves results 1-3 perplexity points but may slow inference 10-20% on some architectures.