Apple Silicon Architecture — Local AI on macOS (Chapter 2)

Apple Silicon chips from the M-series onward share a common die architecture that directly impacts AI performance. The CPU cores, GPU cores, Neural Engine, and RAM all sit on the same piece of silicon. Data does not travel across a PCIe bus between VRAM and system RAM—it travels across the die fabric at tens of GB/s. This is the architecture advantage.

Unified memory comes in fixed sizes: 8 GB, 16 GB, 24 GB, 36 GB, 48 GB, 64 GB, 96 GB. You cannot add more. A 7B parameter model in 4-bit quantization requires approximately 4–5 GB of RAM to load. That leaves 3 GB on an 8 GB machine for the OS, the runtime, and the context window. On paper it works. In practice, macOS will start swapping to disk and performance will crater. The working set is not just the model weights—it is weights plus KV cache plus runtime overhead.

The Neural Engine is designed for fixed-shape inference workloads like image classification and does not accelerate transformer autoregressive decoding efficiently. GPU cores do that work. On M2 Ultra, you get 76 GPU cores. On M4 Max, you get 40. The GPU is where your tokens-per-second count lives.

# Check your chip and memory
system_profiler SPHardwareDataType | grep -E "Chip|Memory"
# Example output:
# Chip: Apple M3 Max
# Memory: 64 GB

# Check core counts
sysctl -n machdep.cpu.brand_string
sysctl -n hw.ncpu
sysctl -n hw.memsize

A real failure mode: M1 base model (8 GB) with a 7B model loaded and a 4096-token context can use 7–9 GB of effective memory. macOS will not kill the process—it will start compressing and swapping. You will get 2 tokens/s with 100% GPU utilization and wonder why. The answer is memory pressure.