GPTQ Quantization — Model Optimization for Local Inference (Chapter 3)

GPTQ (Generative Post-Training Quantization) addresses a fundamental problem in quantization: naive approaches destroy model quality because they treat all weights equally. GPTQ identifies optimal quantization values by solving a per-column reconstruction problem, minimizing the L2 error between quantized and full-precision weights.

The algorithm operates on weight matrices independently. For each matrix column, it computes quantization parameters that best reconstruct outputs using a small calibration dataset. This post-hoc optimization requires the original fp16 model and representative samples—typically 128-512 examples from the model's training distribution.

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset

# Load model in fp16
model = AutoGPTQForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Configure quantization
quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,  # quantize per 128 weights
    desc_act=True,   # activation order for better quality
)

# Prepare calibration data
def get_calibration_samples():
    data = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
    samples = []
    for item in data:
        if len(samples) >= 128:
            break
        if len(item["text"]) > 50:
            samples.append(item["text"][:512])
    return samples

# Quantize
model.quantize(get_calibration_samples(), quantize_config=quantize_config)
model.save_pretrained("llama-2-7b-gptq")

Critical parameters:

bits: Controls precision. 4-bit offers the best accuracy-per-memory tradeoff. 3-bit requires careful tuning and often underperforms. 2-bit is generally unusable for large models.

group_size: Defines how many weights share quantization parameters. Smaller groups (64, 128) preserve quality at the cost of slightly larger models. Power-of-two values enable efficient kernels.

desc_act: Reorders weights by activation magnitude during quantization. This activation-aware ordering typically improves quality 1-2 perplexity points but may hurt latency on some architectures.

The calibration dataset significantly impacts results. Using prompts from the wrong domain (e.g., code samples for a general-purpose model) produces worse quantization. For instruction-tuned models, use instruction-response pairs rather than raw web text.

Memory requirements during quantization can exceed final model size by 2-3x. Budget accordingly when planning the conversion pipeline.

# Expected memory during quantization of 70B model
# ~140GB for fp16 model + ~200GB workspace
# Total: ~340GB VRAM or system RAM