06. Quantization Quality Tradeoffs
Quantization quality measurement requires more than memory savings reporting. Perplexity—the model's uncertainty when predicting text—provides a standardized metric. However, perplexity alone doesn't capture task-specific performance.
Standard evaluation datasets:
- WikiText-2/3: Generative perplexity benchmarks
- C-Eval: Chinese multiple-choice evaluation
- HumanEval: Python code completion
- MMLU: 57-subject multiple-choice test
Create a benchmark suite reflecting actual use cases. A code generation model should be evaluated on code tasks, not just perplexity.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
def evaluate_perplexity(model_name, quant_path=None):
from datasets import load_dataset
model = AutoModelForCausalLM.from_pretrained(
quant_path or model_name,
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
encodings = tokenizer("\n\n".join(data["text"]), return_tensors="pt")
max_length = 1024
stride = 512
seq_len = encodings.input_ids.size(1)
nlls = []
for i in range(0, seq_len, stride):
begin_loc = i
end_loc = min(i + max_length, seq_len)
target_ids = encodings.input_ids[:, begin_loc:end_loc].to("cuda")
with torch.no_grad():
outputs = model(target_ids)
neg_log_likelihood = outputs.loss
nlls.append(neg_log_likelihood)
ppl = torch.exp(torch.stack(nlls).mean()).item()
return ppl
Expected quality degradation at different bit widths:
| Format | Bits | Relative Perplexity | Acceptable? |
|---|---|---|---|
| FP16 | 16 | 1.00 (baseline) | Yes |
| GPTQ | 8 | 1.02-1.05 | Yes |
| GGUF Q8 | 8 | 1.03-1.06 | Yes |
| GPTQ | 4 | 1.05-1.10 | Usually |
| AWQ | 4 | 1.04-1.08 | Usually |
| GGUF Q4_K_M | 4 | 1.05-1.12 | Usually |
| GGUF Q3_K_M | 3 | 1.10-1.20 | Marginal |
| GGUF Q2_K | 2 | 1.15-1.30+ | Problematic |
Beyond perplexity, task-specific degradation varies. Instruction following degrades more than text completion at aggressive quantization. Math capability suffers severely. Logical reasoning holds up reasonably well.
Critical factors affecting quantization quality:
Calibration data alignment: Using out-of-domain calibration samples produces worse results. A model trained on code that uses Wikipedia for calibration will quantize poorly on code tasks.
Model architecture: Some architectures quantize better than others. Models designed with quantization-aware training (Llama 3, Mistral) hold up better than those without (early Llama 1 models).
Group size: Smaller groups (64 vs 128) preserve quality but slightly increase model size. The quality-per-memory tradeoff is favorable for most use cases at group_size=128.
Descent activation ordering: GPTQ's desc_act option typically improves results 1-3 perplexity points but may slow inference 10-20% on some architectures.
Evaluate a 4-bit quantized model on both perplexity and a task relevant to your use case (code generation, instruction following, etc.). Compare results. Document which capability suffers most from quantization.