RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Hardware & infrastructure / Metal (Apple)
Hardware & infrastructure

Metal (Apple)

Metal is Apple's low-level GPU programming framework and API, analogous to Vulkan on other platforms. For local AI operators, Metal enables GPU-accelerated inference on Apple Silicon (M-series) and AMD GPUs in macOS. It is the backend that llama.cpp, MLX, and other runtimes use to offload model computation to the GPU, directly affecting tokens-per-second and the maximum model size that fits in VRAM (unified memory on Apple Silicon).

Deeper dive

Metal provides direct access to the GPU's compute units, allowing neural network operations (matrix multiplications, attention) to run efficiently. On Apple Silicon, Metal leverages the unified memory architecture, meaning the CPU and GPU share the same pool of RAM (e.g., 16 GB, 64 GB). This removes the PCIe transfer bottleneck seen with discrete GPUs, but also means that VRAM is not separate—total system RAM is the limit. Metal Performance Shaders (MPS) offer optimized kernels for common ML operations. In practice, llama.cpp's Metal backend compiles model layers into Metal shaders, achieving ~30-50 tok/s for 7B-13B models on M1/M2/M3 Max/Ultra chips. Operators must ensure their build includes Metal support (e.g., LLAMA_METAL=1).

Practical example

On an M2 MacBook Air with 24 GB unified memory, running llama.cpp with Metal enabled (./main -m model.gguf -ngl 99) loads all layers onto the GPU. A 13B Q4 model (8 GB) fits entirely, yielding ~25 tok/s. Without Metal (CPU-only), the same model runs at ~5 tok/s. On an M1 Ultra with 128 GB, a 70B Q4 model (40 GB) fits, achieving ~10 tok/s.

Workflow example

When building llama.cpp from source, operators set LLAMA_METAL=1 to enable the Metal backend. At runtime, the -ngl N flag offloads N layers to the GPU. In LM Studio, selecting "Metal" as the backend in settings activates GPU acceleration. In MLX (Apple's ML framework), Metal is used automatically. Operators can verify Metal usage by checking for "ggml_metal_init" in logs or monitoring GPU utilization via Activity Monitor.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best Mac for local AI →
  • Best budget Mac →
When it doesn't work
  • MLX out of memory →
  • MPS fallback to CPU →
  • llama.cpp Metal crash →
Compare hardware
  • M4 Max vs RTX 4090 →
  • Mac Studio vs Windows AI PC →
Hardware
  • Apple M4 Max →