07. Unified Memory Explained

Chapter 7 of 15 · 15 min

Unified memory is the defining architectural difference between Apple Silicon and traditional x86_64 systems. In a standard PC, the CPU has its own RAM and the GPU has its own VRAM. Data moves between them across the PCIe bus, which is fast by historical standards but slow compared to accessing memory on the same die.

Apple Silicon puts everything on one die. CPU, GPU, Neural Engine, and RAM share the same physical memory. When the GPU needs to access model weights, it reads directly from RAM at die-fabric bandwidth—hundreds of GB/s. There is no PCIe bottleneck and no duplication of data between system RAM and video RAM.

The consequence for AI: you can load a 13B parameter model in 4-bit quantization (~7 GB) and the GPU reads those weights without paying the memory bandwidth penalty that CUDA GPUs pay when accessing system RAM. This is why a MacBook M3 Pro with 36 GB of unified memory can run a 13B model faster than a gaming PC with a discrete RTX 4060 and 8 GB of VRAM—the RTX has faster raw GPU compute but cannot load the full model in VRAM and must do aggressive quantization or offloading.

The failure mode is simple: you run out of RAM. Unified memory is not virtual—it does not swap transparently. When your model weights plus context plus OS overhead exceed total RAM, macOS starts swapping to SSD. SSD swap on Apple Silicon is fast (NVMe speeds), but it is still 5–10× slower than RAM access. Performance collapses.

Calculate your headroom:

# Get total RAM in bytes
sysctl -n hw.memsize
# e.g., 34359738368 = 32 GB

# Estimate model memory usage:
# Q4_K_M: ~0.74 bytes per parameter
# Q8_0: ~1.25 bytes per parameter
# FP16: ~2 bytes per parameter

# For a 7B model at Q4_K_M: 7e9 * 0.74 / 1e9 = ~5.2 GB
# Add 500 MB for context (2048 tokens, 4-byte per token state)
# Add 200 MB for runtime overhead

# Total: ~5.9 GB for a 7B Q4_K_M model with small context

On a 16 GB machine, a 7B Q4_K_M model uses ~6 GB, leaving 10 GB for OS and other work. This is fine. On an 8 GB machine, 6 GB for the model leaves 2 GB for everything else. This is marginal—context expansion will push you into swap.

EXERCISE

Calculate the maximum model size at Q4_K_M that your machine can reasonably run with a 2048-token context. Formula: max_params = (total_RAM_GB - 4) / 0.74 (subtract 4 GB for OS overhead). Run the result and confirm the model loads without memory pressure warnings in the system log.