Hardware & infrastructure

FLOPS

FLOPS (Floating Point Operations Per Second) measures how many floating-point calculations a processor can perform in one second. In local AI, FLOPS determines the raw compute speed for matrix multiplications during inference and training. Higher FLOPS means faster token generation, but real-world throughput also depends on memory bandwidth and quantization. Operators encounter FLOPS when comparing GPUs: an RTX 4090 delivers ~82 TFLOPS (FP16), while an RTX 3060 manages ~12 TFLOPS, directly affecting tokens per second for large models.

Deeper dive

FLOPS is a theoretical peak metric, rarely achieved in practice due to memory bottlenecks and kernel overhead. For inference, memory bandwidth often limits speed more than FLOPS, especially with large models and long contexts. Quantization reduces precision (e.g., FP16 to INT4), which can increase effective FLOPS because lower-precision operations run faster on modern hardware. Operators should compare FLOPS at the same precision: a GPU with 100 TFLOPS (FP16) may drop to 25 TFLOPS (FP32) if it lacks native FP16 support. In local AI, FLOPS matters most for batch processing or training; for single-stream inference, memory bandwidth is usually the bottleneck.

Practical example

An RTX 3090 has ~35 TFLOPS (FP32) and ~70 TFLOPS (FP16). Running Llama 3.1 8B at Q4 (INT4) on an RTX 3090 yields ~40 tok/s, while an RTX 3060 (12 TFLOPS FP32) manages ~15 tok/s. The 2.9x FLOPS difference roughly matches the 2.7x speedup, showing FLOPS as a rough proxy for inference speed when memory bandwidth is not the limit.

Workflow example

When selecting a GPU for local AI, operators check FLOPS specs on manufacturer pages or sites like TechPowerUp. In llama.cpp, the --memory flag shows VRAM usage, but FLOPS isn't directly displayed. Tools like nvidia-smi report GPU utilization, which hints at whether FLOPS or bandwidth is the bottleneck. For training with Hugging Face Transformers, FLOPS determines epoch time: a 10B model on an RTX 4090 (82 TFLOPS FP16) trains ~3x faster than on an RTX 3080 (30 TFLOPS FP16).

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work