06. CPU-Only Inference
Running AI inference on CPU-only systems is viable for specific use cases. Understanding CPU inference helps when GPU resources are unavailable or unnecessary.
When CPU Inference Makes Sense
- Batch processing where speed is irrelevant (overnight runs)
- Development and testing (quick iteration without GPU overhead)
- Extremely budget systems
- Power-constrained environments (laptops on battery)
Performance Expectations
CPU inference speed depends heavily on cores, memory bandwidth, and architecture:
| CPU | Cores | Memory BW | Tokens/sec (7B INT4) |
|---|---|---|---|
| M1 MacBook Air | 8 (total) | 68 GB/s | 8-12 |
| AMD Ryzen 9 7950X | 16C/32T | 76 GB/s | 15-20 |
| Intel i9-14900K | 24C/32T | 89 GB/s | 18-25 |
| Apple M3 Max | 12C (perf) | 273 GB/s | 30+ |
Apple Silicon's unified memory architecture provides substantial advantage over traditional systems where VRAM and system RAM are separate.
llama.cpp CPU Performance
The llama.cpp library heavily optimizes CPU inference through:
- AVX2/AVX512 instruction sets on x86
- NEON and AMX instructions on ARM
- Quantization kernels (Q4_0, Q5_K, Q8_0)
- Memory-mapped file support
Memory Requirements
CPU inference requires model weights in system RAM (not VRAM). At 7B parameters:
- FP16: 14GB RAM minimum
- INT4: 3.5GB RAM minimum
- Plus 4-8GB for operating system and inference overhead
16GB system RAM comfortably runs 7B models at INT4+. 32GB allows 13B models at INT4.
Failure Mode: Slow Inference
The primary failure mode is unusable speed. A 4096-token response at 2 tokens/sec takes 34 minutes. Interactive use requires at least 5 tokens/sec.
Run a CPU-only inference test with llama.cpp on your current system. Measure tokens per second for a 7B model. Compare with the expected performance for your CPU class.