RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Hardware & infrastructure / FP32
Hardware & infrastructure

FP32

FP32 (32-bit floating point) is a numerical format that uses 32 bits to represent each model weight, offering high precision at the cost of large memory usage. In local AI, FP32 is the standard format for training and serves as the reference for model accuracy. However, for inference on consumer hardware, FP32 is rarely used because it requires 4 bytes per parameter—a 7B model would need ~28 GB of VRAM, exceeding most consumer GPUs. Operators typically quantize models to lower bit widths (e.g., FP16, INT8, or 4-bit) to fit into available VRAM, accepting minor accuracy loss for much faster inference.

Deeper dive

FP32 follows the IEEE 754 single-precision standard, with 1 sign bit, 8 exponent bits, and 23 mantissa bits. It can represent values from ~1.4e-45 to ~3.4e38 with about 7 decimal digits of precision. In deep learning, FP32 is the default for training because its dynamic range and precision prevent gradient underflow. For inference, most runtimes (llama.cpp, Ollama, vLLM) convert models to FP16 or quantized formats before loading. Some frameworks (e.g., MLX on Apple Silicon) natively use FP16 or BF16. The operator-relevant point: running a model in FP32 on a 24 GB RTX 4090 limits you to a ~6B parameter model, whereas 4-bit quantization fits a 70B model in the same VRAM.

Practical example

A 7B parameter model in FP32 requires 7e9 × 4 bytes = 28 GB of VRAM. An RTX 4090 has 24 GB, so it cannot load the model in FP32 without offloading to system RAM, which drops tokens/sec from ~100 to ~5. By quantizing to 4-bit (Q4_K_M), the model uses ~5 GB, fitting entirely in VRAM and running at ~80 tok/s.

Workflow example

When you run llama-cli -m model.gguf with a Q4_K_M quantized model, the runtime loads weights stored as 4-bit integers. If you instead download an FP32 model from Hugging Face and try to load it with transformers, you'll see a memory error on most consumer GPUs. Tools like llama.cpp automatically convert FP32 checkpoints to a quantized format during the conversion step (convert.py), so operators rarely interact with FP32 directly during inference.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →