RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Hardware & infrastructure / FP16
Hardware & infrastructure

FP16

FP16 (16-bit floating point) is a number format that uses 16 bits per weight or activation, balancing precision and memory. In local AI, FP16 is the standard precision for loading models in GPU VRAM because it halves memory usage compared to FP32 (32-bit) while retaining enough accuracy for inference. Operators encounter FP16 when choosing model precision: a 7B parameter model in FP16 occupies ~14 GB, fitting on a 16 GB GPU but not on an 8 GB one. Quantization to lower bit widths (e.g., 4-bit) further reduces memory at the cost of some quality.

Deeper dive

FP16 follows the IEEE 754 standard with 1 sign bit, 5 exponent bits, and 10 mantissa bits. Its dynamic range (~65,504 max value) is sufficient for neural network weights, but training can suffer from gradient underflow. For inference, FP16 is the default in many runtimes (llama.cpp, vLLM) because GPUs have dedicated FP16 tensor cores that double throughput compared to FP32. Operators should note that FP16 models require roughly 2 bytes per parameter, so a 13B model needs ~26 GB VRAM. When VRAM is tight, quantized formats (Q4_K_M, Q5_1) are preferred, but FP16 remains the reference for quality comparisons.

Practical example

A 7B parameter Llama 3 model in FP16 consumes about 14 GB of VRAM. On an RTX 4090 (24 GB), this leaves 10 GB for context and overhead, allowing a 32K token context window. On an RTX 3060 (12 GB), the same model would exceed VRAM, forcing system-RAM offload or quantization to 4-bit (5 GB).

Workflow example

In llama.cpp, you can specify FP16 precision with --memory-f32 0 or by loading a .gguf file that uses FP16 tensors. In Ollama, pulling a model like llama3.1:8b downloads a Q4_K_M quantized file by default; to get FP16, you'd need a custom Modelfile or a separate FP16 GGUF. In Hugging Face Transformers, model.half() converts a model to FP16 before loading onto a GPU.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
When it doesn't work
  • Quantization quality loss →
  • GGUF tokenizer mismatch →