Apple Silicon Deep Dive — Hardware Planning for Local AI (Chapter 8)

Apple Silicon provides compelling performance for local AI through tight hardware-software integration. Understanding the unified memory architecture is essential.

Current Apple Silicon lineup

Chip	CPU Cores	GPU Cores	Unified Memory	Neural Engine
M2	8 (4P+4E)	10	Up to 24GB	16-core
M2 Pro	10+2 (P+E)	16-19	Up to 32GB	16-core
M3	8 (4P+4E)	10	Up to 24GB	16-core
M3 Pro	11+5 (P+E)	14-18	Up to 36GB	16-core
M3 Max	12+4 (P+E)	30-40	Up to 128GB	16-core
M4	8 (4P+4E)	10	Up to 24GB	38-core
M4 Pro	10+4 (P+E)	20	Up to 64GB	38-core
M4 Max	12+4 (P+E)	32-40	Up to 128GB	38-core

Unified Memory Architecture

Unlike NVIDIA systems where VRAM is separate from system RAM, Apple Silicon shares memory between CPU, GPU, and Neural Engine. Bandwidth scales with memory size:

Memory Config	CPU→Memory BW	GPU→Memory BW
24GB	100 GB/s	300 GB/s
36GB	150 GB/s	400 GB/s
64GB	200 GB/s	500 GB/s
128GB	300 GB/s	800 GB/s

The M3 Max 128GB configuration matches data center GPU memory bandwidth while using unified architecture.

Performance Benchmarks

Running Llama 3 8B via llama.cpp with Metal backend:

Device	Backend	Tokens/sec
M2 MacBook Air 24GB	Metal	18-22
M3 Pro MacBook Pro 36GB	Metal	35-40
M3 Max MacBook Pro 128GB	Metal	75-85
M4 Max MacBook Pro 128GB	Metal	95-110

GPU Cores and AI Workloads

GPU core count affects inference performance differently than raw compute:

7B INT4 models: 14-16 GB requirement
13B INT4 models: 22-26 GB requirement
Full performance requires 24GB+ unified memory

M2 (10-core GPU) is constrained for larger models. M3 Pro and above provide better headroom.

Power Efficiency

Running Mistral 7B on battery:

M2 MacBook Air: 8W average, 4-6 hours
M3 MacBook Pro 14": 12W average, 8-10 hours

Equivalent NVIDIA laptop would require 50W+ for similar performance.