RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Hardware Planning for Local AI
  6. /Ch. 6
Hardware Planning for Local AI

06. CPU-Only Inference

Chapter 6 of 20 · 15 min
KEY INSIGHT

CPU inference is practical only for development/testing or non-interactive batch workloads—GPU acceleration is essential for interactive LLM use. ```bash # Run llama.cpp with CPU only (no GPU offload) ./main -m models/llama-3-8b-q4_k_m.gguf \ --seed 42 \ -p "Explain quantum computing" \ -n 256 \ -t 8 \ # threads -ngl 0 # NO GPU layers (CPU only) # Typical output: # llama_new_context_with_model: n_ctx = 2048, n_keep = 0 # CUDA: Not using # AVX2 = 1, AVX_VNNI = 1, FMA = 1 # inference takes 3-5 minutes for 256 tokens ```

Running AI inference on CPU-only systems is viable for specific use cases. Understanding CPU inference helps when GPU resources are unavailable or unnecessary.

When CPU Inference Makes Sense

  • Batch processing where speed is irrelevant (overnight runs)
  • Development and testing (quick iteration without GPU overhead)
  • Extremely budget systems
  • Power-constrained environments (laptops on battery)

Performance Expectations

CPU inference speed depends heavily on cores, memory bandwidth, and architecture:

CPU Cores Memory BW Tokens/sec (7B INT4)
M1 MacBook Air 8 (total) 68 GB/s 8-12
AMD Ryzen 9 7950X 16C/32T 76 GB/s 15-20
Intel i9-14900K 24C/32T 89 GB/s 18-25
Apple M3 Max 12C (perf) 273 GB/s 30+

Apple Silicon's unified memory architecture provides substantial advantage over traditional systems where VRAM and system RAM are separate.

llama.cpp CPU Performance

The llama.cpp library heavily optimizes CPU inference through:

  • AVX2/AVX512 instruction sets on x86
  • NEON and AMX instructions on ARM
  • Quantization kernels (Q4_0, Q5_K, Q8_0)
  • Memory-mapped file support

Memory Requirements

CPU inference requires model weights in system RAM (not VRAM). At 7B parameters:

  • FP16: 14GB RAM minimum
  • INT4: 3.5GB RAM minimum
  • Plus 4-8GB for operating system and inference overhead

16GB system RAM comfortably runs 7B models at INT4+. 32GB allows 13B models at INT4.

Failure Mode: Slow Inference

The primary failure mode is unusable speed. A 4096-token response at 2 tokens/sec takes 34 minutes. Interactive use requires at least 5 tokens/sec.

EXERCISE

Run a CPU-only inference test with llama.cpp on your current system. Measure tokens per second for a 7B model. Compare with the expected performance for your CPU class.

← Chapter 5
GPU Selection: High-End
Chapter 7 →
AMD ROCm Compatibility