Hardware & infrastructure

FP32

FP32 (32-bit floating point) is a numerical format that uses 32 bits to represent each model weight, offering high precision at the cost of large memory usage. In local AI, FP32 is the standard format for training and serves as the reference for model accuracy. However, for inference on consumer hardware, FP32 is rarely used because it requires 4 bytes per parameter—a 7B model would need ~28 GB of VRAM, exceeding most consumer GPUs. Operators typically quantize models to lower bit widths (e.g., FP16, INT8, or 4-bit) to fit into available VRAM, accepting minor accuracy loss for much faster inference.

Deeper dive

FP32 follows the IEEE 754 single-precision standard, with 1 sign bit, 8 exponent bits, and 23 mantissa bits. It can represent values from ~1.4e-45 to ~3.4e38 with about 7 decimal digits of precision. In deep learning, FP32 is the default for training because its dynamic range and precision prevent gradient underflow. For inference, most runtimes (llama.cpp, Ollama, vLLM) convert models to FP16 or quantized formats before loading. Some frameworks (e.g., MLX on Apple Silicon) natively use FP16 or BF16. The operator-relevant point: running a model in FP32 on a 24 GB RTX 4090 limits you to a ~6B parameter model, whereas 4-bit quantization fits a 70B model in the same VRAM.

Practical example

A 7B parameter model in FP32 requires 7e9 × 4 bytes = 28 GB of VRAM. An RTX 4090 has 24 GB, so it cannot load the model in FP32 without offloading to system RAM, which drops tokens/sec from ~100 to ~5. By quantizing to 4-bit (Q4_K_M), the model uses ~5 GB, fitting entirely in VRAM and running at ~80 tok/s.

Workflow example

When you run llama-cli -m model.gguf with a Q4_K_M quantized model, the runtime loads weights stored as 4-bit integers. If you instead download an FP32 model from Hugging Face and try to load it with transformers, you'll see a memory error on most consumer GPUs. Tools like llama.cpp automatically convert FP32 checkpoints to a quantized format during the conversion step (convert.py), so operators rarely interact with FP32 directly during inference.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work