Mixed Precision
Mixed precision is a technique that uses different numerical precisions (e.g., FP16 and FP32) for different parts of a model during inference or training. In local AI, it reduces VRAM usage and increases throughput by storing weights and activations in lower precision (like FP16 or INT8) while keeping critical operations (e.g., accumulation) in higher precision (FP32) to maintain accuracy. Operators encounter it when selecting quantization levels or enabling automatic mixed precision (AMP) in frameworks like PyTorch or llama.cpp, where it can double tokens/sec on GPUs with Tensor Cores.
Deeper dive
Mixed precision exploits hardware support for lower-precision arithmetic (e.g., NVIDIA Tensor Cores for FP16, INT8, INT4) to accelerate matrix multiplications and reduce memory bandwidth. During inference, weights are stored in FP16 or INT8, cutting VRAM usage roughly in half compared to FP32. Activations may also be computed in lower precision, but loss scaling or partial FP32 accumulation prevents underflow/overflow. In training, mixed precision is standard: forward/backward passes use FP16, while master weights and optimizer states remain in FP32. Frameworks like PyTorch offer torch.cuda.amp for automatic mixed precision. For local AI, mixed precision is effectively what quantization achieves (e.g., Q4_K_M in llama.cpp), though quantization uses integer formats rather than floating-point. Operators benefit from faster inference and larger models fitting in VRAM, at the cost of slight accuracy degradation (often negligible).
Practical example
An RTX 3090 with 24 GB VRAM can load Llama 3.1 70B at Q4_K_M (~40 GB) only via system-RAM offload, yielding ~2 tok/s. Using mixed precision with FP16 weights (no quantization) would require ~140 GB — impossible. But for smaller models like Llama 3.1 8B, FP16 uses ~16 GB, fitting on the 3090. Enabling mixed precision (FP16 compute, FP32 accumulation) via llama.cpp's --memory-f32 flag or PyTorch AMP can increase throughput from ~30 to ~50 tok/s on a 3090.
Workflow example
In llama.cpp, mixed precision is controlled by the --memory-f32 and --type flags. For example, ./main -m model.gguf --memory-f32 0 uses FP16 for KV cache, reducing VRAM usage. In Hugging Face Transformers, operators enable mixed precision inference with model.half() or by loading with torch_dtype=torch.float16. In training, PyTorch AMP is activated via with torch.cuda.amp.autocast(): and scaler.scale(loss).backward(). LM Studio and Ollama abstract this, but the underlying runtime (llama.cpp) applies mixed precision automatically based on the GGUF quantization type.
Reviewed by Fredoline Eruo. See our editorial policy.