03. GPTQ Quantization
GPTQ (Generative Post-Training Quantization) addresses a fundamental problem in quantization: naive approaches destroy model quality because they treat all weights equally. GPTQ identifies optimal quantization values by solving a per-column reconstruction problem, minimizing the L2 error between quantized and full-precision weights.
The algorithm operates on weight matrices independently. For each matrix column, it computes quantization parameters that best reconstruct outputs using a small calibration dataset. This post-hoc optimization requires the original fp16 model and representative samples—typically 128-512 examples from the model's training distribution.
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset
# Load model in fp16
model = AutoGPTQForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.float16,
device_map="auto"
)
# Configure quantization
quantize_config = BaseQuantizeConfig(
bits=4,
group_size=128, # quantize per 128 weights
desc_act=True, # activation order for better quality
)
# Prepare calibration data
def get_calibration_samples():
data = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
samples = []
for item in data:
if len(samples) >= 128:
break
if len(item["text"]) > 50:
samples.append(item["text"][:512])
return samples
# Quantize
model.quantize(get_calibration_samples(), quantize_config=quantize_config)
model.save_pretrained("llama-2-7b-gptq")
Critical parameters:
bits: Controls precision. 4-bit offers the best accuracy-per-memory tradeoff. 3-bit requires careful tuning and often underperforms. 2-bit is generally unusable for large models.
group_size: Defines how many weights share quantization parameters. Smaller groups (64, 128) preserve quality at the cost of slightly larger models. Power-of-two values enable efficient kernels.
desc_act: Reorders weights by activation magnitude during quantization. This activation-aware ordering typically improves quality 1-2 perplexity points but may hurt latency on some architectures.
The calibration dataset significantly impacts results. Using prompts from the wrong domain (e.g., code samples for a general-purpose model) produces worse quantization. For instruction-tuned models, use instruction-response pairs rather than raw web text.
Memory requirements during quantization can exceed final model size by 2-3x. Budget accordingly when planning the conversion pipeline.
# Expected memory during quantization of 70B model
# ~140GB for fp16 model + ~200GB workspace
# Total: ~340GB VRAM or system RAM
Quantize a 7B model at bits=4, group_size=128, and bits=4, group_size=64. Measure perplexity difference on a held-out dataset. Calculate the actual memory reduction for each configuration.