COURSE · FND · B008
Local AI on macOS
Learn local ai on macos through RunLocalAI's practical lens: macos, apple silicon, metal and mlx, hardware fit, runtime settings, verification habits and local-vs-cloud tradeoffs.
PREREQUISITES
- B001
- B003
Course B008: Local AI on macOS
Why this course exists
Apple Silicon has a GPU, RAM, and CPU on the same die with unified memory architecture. This gives macOS a real performance advantage for local AI that most people leave on the table. The catch: the tooling ecosystem is fragmented and the default settings are rarely optimal. This course gets you from "it runs" to "it's actually fast."
What you will know after
- Enable Metal GPU acceleration and verify it is actually active
- Deploy and tune Ollama, LM Studio, and MLX-native models on Apple Silicon
- Calculate correct model sizes for your hardware given unified memory constraints
- Diagnose common macOS AI failures using Activity Monitor and logs
- Chain together local AI tools into working workflows
CHAPTERS
- 01macOS AI LandscapemacOS has three AI stacks—Metal, llama.cpp, and MLX—each with different performance characteristics and compatibility.10 min
- 02Apple Silicon ArchitectureUnified memory is fixed and shared—model weights plus context plus OS overhead must fit in your total RAM or performance collapses.15 min
- 03Ollama on macOSOllama's CLI and API server are separate processes—if the CLI works but the API fails, start `ollama serve` explicitly.20 min
- 04Metal GPU AccelerationMetal GPU acceleration is automatic in most runtimes—but you must verify it's actually engaged because silent fallback to CPU is common.20 min
- 05MLX FrameworkMLX achieves 2–4× better throughput than llama.cpp on Apple Silicon because it was built for the architecture rather than ported to it.15 min
- 06MLX vs Ollama vs llama.cppUse Ollama to experiment, MLX for production throughput on Apple Silicon, and llama.cpp when you need quantization formats the other two do not support.15 min
- 07Unified Memory ExplainedUnified memory eliminates the PCIe bottleneck but not the capacity constraint—your model, context, and OS all share one fixed RAM pool.15 min
- 08Mac Model Selection GuideModel selection on Apple Silicon is a RAM calculation first and a capability match second—pick the largest model that fits in your available memory with headroom for the context window.15 min
- 09Performance TuningReducing context window is the single highest-leverage performance tuning on Apple Silicon—smaller contexts use less memory and run faster.15 min
- 10Activity Monitor for AIActivity Monitor's GPU tab tells you in 5 seconds whether Metal is active—if GPU is under 10% during inference, Metal is not being used.15 min
- 11Running Docker on MacDocker Desktop on macOS cannot passthrough Metal—GPU inference must run on the host OS, not in a container.15 min
- 12LM Studio on macOSLM Studio provides the fastest GUI path to a running local AI server with real-time performance metrics visible during inference.15 min
- 13Open WebUI on macOSOpen WebUI connects to Ollama's API—Ollama must be running and accessible on port 11434 before Open WebUI can serve models.20 min
- 14Troubleshooting macOS AIMost macOS AI failures trace to three causes: Metal not active, memory exhausted, or the server process not running—check those three before anything else.15 min
- 15macOS AI WorkflowsThe tools work best as a pipeline—Ollama as the runtime, LM Studio or Open WebUI as the interface, and batch scripts or API wrappers for automation.20 min