CPU-Only Inference — Hardware Planning for Local AI (Chapter 6)

Running AI inference on CPU-only systems is viable for specific use cases. Understanding CPU inference helps when GPU resources are unavailable or unnecessary.

When CPU Inference Makes Sense

Batch processing where speed is irrelevant (overnight runs)
Development and testing (quick iteration without GPU overhead)
Extremely budget systems
Power-constrained environments (laptops on battery)

Performance Expectations

CPU inference speed depends heavily on cores, memory bandwidth, and architecture:

CPU	Cores	Memory BW	Tokens/sec (7B INT4)
M1 MacBook Air	8 (total)	68 GB/s	8-12
AMD Ryzen 9 7950X	16C/32T	76 GB/s	15-20
Intel i9-14900K	24C/32T	89 GB/s	18-25
Apple M3 Max	12C (perf)	273 GB/s	30+

Apple Silicon's unified memory architecture provides substantial advantage over traditional systems where VRAM and system RAM are separate.

llama.cpp CPU Performance

The llama.cpp library heavily optimizes CPU inference through:

AVX2/AVX512 instruction sets on x86
NEON and AMX instructions on ARM
Quantization kernels (Q4_0, Q5_K, Q8_0)
Memory-mapped file support

Memory Requirements

CPU inference requires model weights in system RAM (not VRAM). At 7B parameters:

FP16: 14GB RAM minimum
INT4: 3.5GB RAM minimum
Plus 4-8GB for operating system and inference overhead

16GB system RAM comfortably runs 7B models at INT4+. 32GB allows 13B models at INT4.

Failure Mode: Slow Inference

The primary failure mode is unusable speed. A 4096-token response at 2 tokens/sec takes 34 minutes. Interactive use requires at least 5 tokens/sec.