RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Model Optimization for Local Inference
  6. /Ch. 3
Model Optimization for Local Inference

03. GPTQ Quantization

Chapter 3 of 18 · 15 min
KEY INSIGHT

GPTQ's per-column optimization preserves model capability far better than uniform quantization because it treats critical weights differently from redundant ones.

GPTQ (Generative Post-Training Quantization) addresses a fundamental problem in quantization: naive approaches destroy model quality because they treat all weights equally. GPTQ identifies optimal quantization values by solving a per-column reconstruction problem, minimizing the L2 error between quantized and full-precision weights.

The algorithm operates on weight matrices independently. For each matrix column, it computes quantization parameters that best reconstruct outputs using a small calibration dataset. This post-hoc optimization requires the original fp16 model and representative samples—typically 128-512 examples from the model's training distribution.

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset

# Load model in fp16
model = AutoGPTQForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Configure quantization
quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,  # quantize per 128 weights
    desc_act=True,   # activation order for better quality
)

# Prepare calibration data
def get_calibration_samples():
    data = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
    samples = []
    for item in data:
        if len(samples) >= 128:
            break
        if len(item["text"]) > 50:
            samples.append(item["text"][:512])
    return samples

# Quantize
model.quantize(get_calibration_samples(), quantize_config=quantize_config)
model.save_pretrained("llama-2-7b-gptq")

Critical parameters:

bits: Controls precision. 4-bit offers the best accuracy-per-memory tradeoff. 3-bit requires careful tuning and often underperforms. 2-bit is generally unusable for large models.

group_size: Defines how many weights share quantization parameters. Smaller groups (64, 128) preserve quality at the cost of slightly larger models. Power-of-two values enable efficient kernels.

desc_act: Reorders weights by activation magnitude during quantization. This activation-aware ordering typically improves quality 1-2 perplexity points but may hurt latency on some architectures.

The calibration dataset significantly impacts results. Using prompts from the wrong domain (e.g., code samples for a general-purpose model) produces worse quantization. For instruction-tuned models, use instruction-response pairs rather than raw web text.

Memory requirements during quantization can exceed final model size by 2-3x. Budget accordingly when planning the conversion pipeline.

# Expected memory during quantization of 70B model
# ~140GB for fp16 model + ~200GB workspace
# Total: ~340GB VRAM or system RAM
EXERCISE

Quantize a 7B model at bits=4, group_size=128, and bits=4, group_size=64. Measure perplexity difference on a held-out dataset. Calculate the actual memory reduction for each configuration.

← Chapter 2
Quantization Formats Compared
Chapter 4 →
AWQ Quantization