RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Hardware & infrastructure / FLOPS
Hardware & infrastructure

FLOPS

FLOPS (Floating Point Operations Per Second) measures how many floating-point calculations a processor can perform in one second. In local AI, FLOPS determines the raw compute speed for matrix multiplications during inference and training. Higher FLOPS means faster token generation, but real-world throughput also depends on memory bandwidth and quantization. Operators encounter FLOPS when comparing GPUs: an RTX 4090 delivers ~82 TFLOPS (FP16), while an RTX 3060 manages ~12 TFLOPS, directly affecting tokens per second for large models.

Deeper dive

FLOPS is a theoretical peak metric, rarely achieved in practice due to memory bottlenecks and kernel overhead. For inference, memory bandwidth often limits speed more than FLOPS, especially with large models and long contexts. Quantization reduces precision (e.g., FP16 to INT4), which can increase effective FLOPS because lower-precision operations run faster on modern hardware. Operators should compare FLOPS at the same precision: a GPU with 100 TFLOPS (FP16) may drop to 25 TFLOPS (FP32) if it lacks native FP16 support. In local AI, FLOPS matters most for batch processing or training; for single-stream inference, memory bandwidth is usually the bottleneck.

Practical example

An RTX 3090 has ~35 TFLOPS (FP32) and ~70 TFLOPS (FP16). Running Llama 3.1 8B at Q4 (INT4) on an RTX 3090 yields ~40 tok/s, while an RTX 3060 (12 TFLOPS FP32) manages ~15 tok/s. The 2.9x FLOPS difference roughly matches the 2.7x speedup, showing FLOPS as a rough proxy for inference speed when memory bandwidth is not the limit.

Workflow example

When selecting a GPU for local AI, operators check FLOPS specs on manufacturer pages or sites like TechPowerUp. In llama.cpp, the --memory flag shows VRAM usage, but FLOPS isn't directly displayed. Tools like nvidia-smi report GPU utilization, which hints at whether FLOPS or bandwidth is the bottleneck. For training with Hugging Face Transformers, FLOPS determines epoch time: a 10B model on an RTX 4090 (82 TFLOPS FP16) trains ~3x faster than on an RTX 3080 (30 TFLOPS FP16).

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →