01. Why Optimize?
Deploying large language models locally exposes an uncomfortable reality: raw model performance and practical inference speed are fundamentally different things. A 70B parameter model loaded in fp16 requires roughly 140GB of GPU memory. Most consumer hardware maxes out around 24-80GB. Without optimization, that model simply does not run.
The optimization landscape splits into two primary concerns: memory reduction and computation acceleration. Quantization attacks the memory problem by reducing weight precision from 32-bit or 16-bit floats to 4, 3, or even 2 bits. Speculative decoding and draft models accelerate autoregressive generation by computing cheap approximations for most tokens while reserving expensive computation for tokens that need it.
The financial case is equally compelling. Cloud GPU instances at $2-3 per hour add up quickly. A development workflow requiring 20 hours weekly of inference costs $120-180 monthly. That same workload on optimized local hardware costs electricity—typically under $10 monthly for typical usage patterns.
Consider the practical bottleneck. When generating text, the attention mechanism dominates latency. For a 4096-token context, attention operations perform O(n²) computations relative to sequence length. Optimization techniques that reduce memory bandwidth requirements directly translate to lower latency.
Failure modes to anticipate:
# CUDA out of memory when loading unoptimized model
python -c "import transformers; model = transformers.AutoModelForCausalLM.from_pretrained('mistralai/Mistral-7B-v0.1')"
# OOM killed at ~28GB for fp16 on 3090
Understanding where time goes matters more than memorizing solutions. Profile first, optimize second. Tools like nvidia-smi dmon, torch.profiler, and model-specific benchmarking scripts reveal whether latency originates from GPU compute, memory transfer, or attention overhead.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Run nvidia-smi during inference with an unoptimized model. Note peak memory usage and GPU utilization. Compare generation speed at different sequence lengths (64, 256, 1024 tokens).