FP8
FP8 (Floating Point 8) is an 8-bit floating-point number format used in AI inference and training to reduce memory and computation requirements. It stores numbers with a sign bit, exponent bits, and mantissa bits, offering a trade-off between range and precision. Two common variants exist: E4M3 (4 exponent bits, 3 mantissa bits) for weights and activations, and E5M2 (5 exponent bits, 2 mantissa bits) for gradients. FP8 enables faster matrix multiplications on compatible hardware (e.g., NVIDIA H100, RTX 4090 with limited support) by halving memory bandwidth compared to FP16, allowing larger models or batch sizes within the same VRAM budget.
Deeper dive
FP8 is part of a trend toward lower-precision arithmetic in neural networks, following FP16 and BF16. The format is standardized by NVIDIA and others, with hardware support starting from Hopper (H100) and Ada Lovelace (RTX 40 series) GPUs. In practice, FP8 inference requires careful quantization to avoid accuracy loss; models are typically calibrated using a small dataset to determine scaling factors. FP8 can be used for both weights and activations, but many operators still rely on FP16 or INT8 due to broader hardware compatibility. For local AI, FP8 is most relevant when running large models (e.g., 70B) on high-end consumer GPUs like the RTX 4090, where VRAM is limited to 24 GB. Using FP8 can reduce memory usage by ~50% compared to FP16, potentially fitting a model that would otherwise require offloading. However, not all inference engines support FP8; llama.cpp added experimental FP8 support in 2024, while vLLM supports it for H100. Operators should verify their hardware and software stack before relying on FP8.
Practical example
An RTX 4090 has 24 GB VRAM. Running Llama 3.1 70B at FP16 requires ~140 GB, impossible without offloading. At FP8, the same model requires ~70 GB, still too large. But for a 13B model, FP16 uses ~26 GB (exceeds VRAM), while FP8 uses ~13 GB, fitting comfortably. This allows the operator to run the 13B model entirely on GPU without offloading, achieving higher tokens/sec.
Workflow example
In llama.cpp, you can enable FP8 by compiling with -DLLAMA_FP8=ON and then using a quantized model in FP8 format (e.g., llama.cpp/models/llama-13b-fp8.gguf). When loading, the runtime will allocate half the VRAM compared to FP16. In LM Studio, FP8 support may appear as a quantization option in the model loader. Operators should check the model card for FP8 availability and benchmark tokens/sec against FP16 or Q4_0 quantizations to evaluate the trade-off.
Reviewed by Fredoline Eruo. See our editorial policy.