COURSE · FND · B008

Local AI on macOS

Learn local ai on macos through RunLocalAI's practical lens: macos, apple silicon, metal and mlx, hardware fit, runtime settings, verification habits and local-vs-cloud tradeoffs.

15 chapters5hFoundations trackBy Fredoline Eruo
PREREQUISITES
  • B001
  • B003

Course B008: Local AI on macOS

Why this course exists

Apple Silicon has a GPU, RAM, and CPU on the same die with unified memory architecture. This gives macOS a real performance advantage for local AI that most people leave on the table. The catch: the tooling ecosystem is fragmented and the default settings are rarely optimal. This course gets you from "it runs" to "it's actually fast."

What you will know after

  • Enable Metal GPU acceleration and verify it is actually active
  • Deploy and tune Ollama, LM Studio, and MLX-native models on Apple Silicon
  • Calculate correct model sizes for your hardware given unified memory constraints
  • Diagnose common macOS AI failures using Activity Monitor and logs
  • Chain together local AI tools into working workflows
CHAPTERS
  1. 01macOS AI LandscapemacOS has three AI stacks—Metal, llama.cpp, and MLX—each with different performance characteristics and compatibility.10 min
  2. 02Apple Silicon ArchitectureUnified memory is fixed and shared—model weights plus context plus OS overhead must fit in your total RAM or performance collapses.15 min
  3. 03Ollama on macOSOllama's CLI and API server are separate processes—if the CLI works but the API fails, start `ollama serve` explicitly.20 min
  4. 04Metal GPU AccelerationMetal GPU acceleration is automatic in most runtimes—but you must verify it's actually engaged because silent fallback to CPU is common.20 min
  5. 05MLX FrameworkMLX achieves 2–4× better throughput than llama.cpp on Apple Silicon because it was built for the architecture rather than ported to it.15 min
  6. 06MLX vs Ollama vs llama.cppUse Ollama to experiment, MLX for production throughput on Apple Silicon, and llama.cpp when you need quantization formats the other two do not support.15 min
  7. 07Unified Memory ExplainedUnified memory eliminates the PCIe bottleneck but not the capacity constraint—your model, context, and OS all share one fixed RAM pool.15 min
  8. 08Mac Model Selection GuideModel selection on Apple Silicon is a RAM calculation first and a capability match second—pick the largest model that fits in your available memory with headroom for the context window.15 min
  9. 09Performance TuningReducing context window is the single highest-leverage performance tuning on Apple Silicon—smaller contexts use less memory and run faster.15 min
  10. 10Activity Monitor for AIActivity Monitor's GPU tab tells you in 5 seconds whether Metal is active—if GPU is under 10% during inference, Metal is not being used.15 min
  11. 11Running Docker on MacDocker Desktop on macOS cannot passthrough Metal—GPU inference must run on the host OS, not in a container.15 min
  12. 12LM Studio on macOSLM Studio provides the fastest GUI path to a running local AI server with real-time performance metrics visible during inference.15 min
  13. 13Open WebUI on macOSOpen WebUI connects to Ollama's API—Ollama must be running and accessible on port 11434 before Open WebUI can serve models.20 min
  14. 14Troubleshooting macOS AIMost macOS AI failures trace to three causes: Metal not active, memory exhausted, or the server process not running—check those three before anything else.15 min
  15. 15macOS AI WorkflowsThe tools work best as a pipeline—Ollama as the runtime, LM Studio or Open WebUI as the interface, and batch scripts or API wrappers for automation.20 min