04. Metal GPU Acceleration

Chapter 4 of 15 · 20 min

Metal is Apple's GPU framework and the only path to hardware-accelerated inference on Apple Silicon. Without Metal, your GPU sits idle and the CPU handles everything. With Metal, the Neural Engine and GPU cores actually contribute to inference.

Check if Metal is active:

# List Metal-capable GPUs
/system/Library/Frameworks/Metal.framework/Versions/A/Executables/metal
# Runs metal device -s to list devices

# More readable check
system_profiler SPDisplaysDataType 2>/dev/null | grep -A3 "Metal"

Ollama enables Metal automatically when the model file supports it and the device has GPU capability. You can verify the GPU is engaged during inference:

# During inference, open a new terminal tab and run:
sudo powermetrics --samplers gpu -i 1000 -n 1

This requires sudo and outputs GPU activity. A GPU utilization of 0–5% during inference means Metal is not routing work to the GPU.

For llama.cpp, Metal is enabled by compiling with the METAL flag:

# If building llama.cpp from source on macOS
CMAKE_ARGS="-DGGML_METAL=ON" make

The compiled binary will automatically detect and use Metal devices. You can also set GGML_METAL_DEVICE_TYPES to specify which device, though on a MacBook there is only one GPU anyway.

Real failure mode: Metal device not found errors. This happens when you run inside a Docker container on macOS because Docker Desktop runs a Linux VM with no Metal access. The GPU passthrough is not supported on Docker Desktop for Mac. This is covered in Chapter 11.

Another failure: Some quantized model formats (especially older Q4_0) were compiled before Metal support was added to the quantizer. Re-download the model to get a Metal-compatible build, or use a more recent quantization scheme like Q4_K_M.

EXERCISE

Run a model via Ollama, open Activity Monitor, sort by GPU percentage, confirm it shows >0% during active inference. If it shows 0%, investigate whether your model binary has Metal support.