Hardware & infrastructure

FP16

FP16 (16-bit floating point) is a number format that uses 16 bits per weight or activation, balancing precision and memory. In local AI, FP16 is the standard precision for loading models in GPU VRAM because it halves memory usage compared to FP32 (32-bit) while retaining enough accuracy for inference. Operators encounter FP16 when choosing model precision: a 7B parameter model in FP16 occupies ~14 GB, fitting on a 16 GB GPU but not on an 8 GB one. Quantization to lower bit widths (e.g., 4-bit) further reduces memory at the cost of some quality.

Deeper dive

FP16 follows the IEEE 754 standard with 1 sign bit, 5 exponent bits, and 10 mantissa bits. Its dynamic range (~65,504 max value) is sufficient for neural network weights, but training can suffer from gradient underflow. For inference, FP16 is the default in many runtimes (llama.cpp, vLLM) because GPUs have dedicated FP16 tensor cores that double throughput compared to FP32. Operators should note that FP16 models require roughly 2 bytes per parameter, so a 13B model needs ~26 GB VRAM. When VRAM is tight, quantized formats (Q4_K_M, Q5_1) are preferred, but FP16 remains the reference for quality comparisons.

Practical example

A 7B parameter Llama 3 model in FP16 consumes about 14 GB of VRAM. On an RTX 4090 (24 GB), this leaves 10 GB for context and overhead, allowing a 32K token context window. On an RTX 3060 (12 GB), the same model would exceed VRAM, forcing system-RAM offload or quantization to 4-bit (5 GB).

Workflow example

In llama.cpp, you can specify FP16 precision with --memory-f32 0 or by loading a .gguf file that uses FP16 tensors. In Ollama, pulling a model like llama3.1:8b downloads a Q4_K_M quantized file by default; to get FP16, you'd need a custom Modelfile or a separate FP16 GGUF. In Hugging Face Transformers, model.half() converts a model to FP16 before loading onto a GPU.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

Best GPU for local AI →

When it doesn't work