RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Evaluation metrics / Precision
Evaluation metrics

Precision

Precision in local AI refers to the number of bits used to represent each weight and activation in a neural network. Lower precision (e.g., 4-bit) reduces model size and memory bandwidth requirements, enabling larger models to run on consumer hardware, but can degrade output quality. Common precisions include FP32 (32-bit), FP16 (16-bit), and quantized integer formats like Q4_K_M (4-bit) used in llama.cpp. The operator must balance VRAM constraints against acceptable perplexity loss.

Deeper dive

Precision directly impacts VRAM usage and inference speed. A 7B parameter model in FP32 requires ~28 GB, while Q4_K_M reduces it to ~4 GB, fitting on a 6 GB GPU. Lower precision also increases tokens per second because less data moves across the memory bus. However, aggressive quantization (e.g., 2-bit) can introduce noticeable quality loss. llama.cpp offers a range of quantization levels (Q2, Q3, Q4, Q5, Q6, Q8) with trade-offs between size and fidelity. Operators typically choose the highest precision that fits their VRAM budget.

Practical example

A 13B model in FP16 (26 GB) exceeds the 16 GB VRAM of an RTX 4060 Ti. Using Q4_K_M quantization (7 GB) fits comfortably, allowing 40 tok/s inference. Going to Q2 (4 GB) would run even faster but may degrade output coherence.

Workflow example

In llama.cpp, the operator selects precision via the -b flag (e.g., -b 4 for 4-bit) or by downloading a pre-quantized GGUF file like llama-2-13b.Q4_K_M.gguf. In LM Studio, the model card lists available quantizations; the operator picks one that fits their GPU VRAM. In MLX, precision is set via model.load_weights(..., dtype=mx.float16).

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →