MLX-LM
Apple's Metal-native ML framework's LLM runner. Now competitive with llama.cpp Metal on M-series silicon, with better long-context performance.
Overview
What MLX-LM actually is
MLX-LM is the canonical Python inference library for Apple Silicon, built on Apple's MLX array framework. It is not a wrapper around llama.cpp — it is a fundamentally different code path with native Metal kernels written by Apple's MLX team, lazy-evaluated computation graphs, and unified-memory awareness baked into the design.
For Apple Silicon, MLX-LM is the highest-throughput first-party inference path in 2026. It outperforms llama.cpp's Metal backend by 15-35 % on M2 / M3 / M4 generations on most decoder-only models, and it is the only path that ships native MLX-4bit / MLX-8bit quantization formats designed specifically for unified-memory bandwidth profiles.
Where it fits in the stack
MLX-LM lives at the engine layer for Apple Silicon, full stop. The stack on macOS:
- Frontend: Open WebUI, LM Studio (LM Studio's MLX backend uses MLX-LM internals), or any OAI-compatible client
- Server:
mlx_lm.serverexposes an OpenAI-compatible endpoint - Engine: MLX-LM
- Hardware: Apple Silicon — M1 through M4, including M-series Pro / Max / Ultra and iPad Pro M-series
It is not a Linux engine, not a Windows engine, not a CUDA engine. If your fleet is mixed-OS, llama.cpp is the cross-platform fallback — but on a Mac dev box or a Mac Studio inference workstation, MLX-LM is the right answer.
Best use cases
- Local LLM dev on a MacBook Pro M3 / M4 Max. 32-128 GB unified memory means you can prototype 70B-class models that would not fit on a single 24 GB consumer NVIDIA card.
- Mac Studio inference workstation. M3 Ultra with 192 GB unified memory + MLX-LM is the cheapest path to running Llama 3.1 70B at FP16 anywhere outside a datacenter.
- Apple Silicon-resident agentic stacks. Pair with the same memory + MCP toolset as /stacks/local-coding-agent but route inference through MLX-LM instead of vLLM.
- Battery-aware inference research. MLX-LM's lazy evaluation and unified-memory model mean it idles cheaply between requests.
OS support
| OS | Quality |
|---|---|
| macOS 14+ (Apple Silicon) | excellent — only supported target |
| macOS 13 (Ventura) | partial — works for older MLX versions; new releases require Sonoma+ |
| Anything else | unsupported |
Hardware / backend support
Apple Silicon only — M1, M1 Pro, M1 Max, M1 Ultra, M2 / Pro / Max / Ultra, M3 / Pro / Max / Ultra, M4 / Pro / Max. The performance ladder roughly tracks memory bandwidth, not raw GPU FLOPs:
- M1 / M2 / M3 / M4 (base) — ~100 GB/s; usable for 7B-class
- M-series Pro — ~150-200 GB/s; comfortable for 13B-class
- M-series Max — ~300-400 GB/s; comfortable for 32B-class at 4-bit
- M-series Ultra — ~800 GB/s; the realistic 70B+ tier
The Neural Engine on Apple Silicon is not used by MLX-LM — all compute goes through the integrated GPU via Metal. ANE is fixed-function and not addressable for arbitrary transformer kernels.
Model / quant format support
- MLX-4bit / MLX-8bit — native quants; fastest path; conversion via
mlx_lm.convert - FP16 / BF16 — full-precision baseline; the 192 GB Ultra makes this realistic for 70B
- GGUF — partial support via conversion; not the recommended path
- AWQ / GPTQ / EXL2 — unsupported; these are CUDA-kernel-bound formats
If you want the format-by-format breakdown across runtimes see /systems/quantization-formats.
Setup path
pip install mlx-lm
mlx_lm.generate --model mlx-community/Llama-3.1-8B-Instruct-4bit --prompt "Hello"
For an HTTP server:
mlx_lm.server --model mlx-community/Llama-3.1-8B-Instruct-4bit --port 8080
The mlx-community Hugging Face org hosts pre-converted MLX-4bit checkpoints for most popular open-weight models. Conversion of an arbitrary HF safetensors model is one command:
mlx_lm.convert --hf-path meta-llama/Llama-3.1-8B-Instruct -q
What breaks first
- Metal OOM mid-generation. Unified memory swaps to disk silently before refusing; tokens-per-sec collapses to ~1. See /errors/metal-out-of-memory.
- Models without an MLX checkpoint published. Conversion is one command but takes 10-30 min and needs the original HF safetensors local.
- Long-context decode quality on small Macs. The 8 GB / 16 GB base M-series machines run out of KV-cache headroom past ~8K tokens on 7B-class models.
- Pip dependency drift. MLX moves quickly; a working environment can break on a
pip install -U. Pin versions in production. - Unsupported architectures landing late. Brand-new model families (Mamba-2, RWKV-7, novel MoE routers) sometimes arrive in MLX 1-3 months after llama.cpp.
Alternatives by intent
| If you want… | Reach for |
|---|---|
| Friendly UX, same-ish performance on Mac | LM Studio (uses MLX under the hood) |
| Cross-platform, GGUF | llama.cpp or Ollama |
| Apple-native fine-tuning | MLX (the framework, not just MLX-LM) |
| CUDA-class throughput | move to RTX 4090 + vLLM |
Best pairings
- Open WebUI — point at the MLX-LM HTTP endpoint
- Continue.dev / Aider — coding-agent workflows on Mac dev machines
- Apple M3 Ultra — the canonical inference-workstation pairing for 70B-class models
- Apple M4 Max — the canonical battery-aware dev pairing
Who should avoid MLX-LM
- Anyone on Linux / Windows. Period.
- Multi-tenant production serving with concurrent users. MLX-LM serves one stream well, not 50.
- Workloads needing AWQ-INT4 fit. Apple Silicon has its own quant story; MLX-4bit ≠ AWQ-INT4 in either format or kernel design.
- Teams that need reproducible Linux builds. The Mac-only target is a real constraint.
Related
- Stacks: /stacks/multi-machine-apple-cluster, /stacks/local-coding-agent
- System guides: /systems/quantization-formats, /setup
- Hardware: Apple M3 Ultra, Apple M4 Max
- Errors: /errors/metal-out-of-memory
Setup guidance
Install on macOS (Apple Silicon only, M1+): pip install mlx-lm. Requires Python 3.10+ and macOS 14.0+. MLX-LM uses Apple's MLX framework — Metal GPU acceleration without CUDA. Run a model: mlx_lm.generate --model mlx-community/Llama-3.2-3B-Instruct-4bit --prompt "Hello". For server mode: mlx_lm.server --model mlx-community/Llama-3.2-3B-Instruct-4bit --port 8080. The server exposes OpenAI-compatible /v1/chat/completions at port 8080. Verify: curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"messages":[{"role":"user","content":"Hello"}]}'. MLX models use Apple's MLX format (.safetensors with MLX-compatible config), distinct from GGUF or HF-native formats. The mlx-community HuggingFace org hosts pre-converted models — search for mlx-community/<model-name>-4bit. First run downloads the model checkpoint; a 4-bit 7B model is ~4 GB and downloads in 2–5 minutes on good WiFi. Time-to-first-response: ~5 seconds after download for model load + Metal shader compilation. MLX-LM auto-detects Apple Silicon and uses the Neural Engine for some operations.
Workload fit
Best for: Apple Silicon-native local inference (M1–M4 family, including iPad Pros with M-series chips), macOS/iOS app development embedding LLM inference via MLX Swift, developers in the Apple ecosystem who value Metal-first tooling, single-user chat and code generation on MacBook Pro, Mac Studio, or iMac, scenarios where unified memory lets you run larger models than a comparable NVIDIA consumer GPU (e.g. 64 GB unified memory vs 24 GB VRAM). Not suited for: non-Apple hardware, production multi-tenant serving (use vLLM on NVIDIA), Windows or Linux deployment (MLX is macOS-specific), GGUF-based model ecosystems without format conversion, training or fine-tuning of large models (MLX-LM is primarily inference-focused).
Alternatives
Use MLX-LM when on Apple Silicon and you want the best Metal GPU utilization — MLX benchmarks 5–15% faster throughput than llama.cpp Metal on many models due to optimized unified-memory access patterns and Apple Neural Engine integration. The Swift/Python interop is strong: MLX models can be loaded in Swift apps via the MLX Swift library, making it the right choice for native macOS/iOS LLM apps. Switch to llama.cpp when you need the broadest model format support — MLX format has a narrower pre-converted model catalog. Use Ollama on macOS when you want zero-config setup with automatic model downloads; Ollama wraps llama.cpp on Metal but handles model management. Use LM Studio on macOS when you want a GUI. For non-Apple hardware, MLX has no support — use vLLM (NVIDIA) or llama.cpp (everything else).
Troubleshooting + when to switch
Problem: ValueError: Unsupported model type when loading a HuggingFace model directly. Fix: MLX-LM only loads MLX-format models. Convert with: mlx_lm.convert --hf-path meta-llama/Llama-3.2-3B-Instruct --mlx-path ./converted-model -q. The -q flag quantizes during conversion. Or use pre-converted models from mlx-community on HuggingFace. Problem: Performance slower than expected on M3 Max/M4 Max. Fix: MLX-LM's Metal shaders may be cold-compiling on first run. Run a warmup generation first. Ensure macOS is 15.0+ for M4 Neural Engine support. Check mlx.core.metal.is_available() returns True. Some models require --max-kv-size to be raised for long context. Problem: Server doesn't support /v1/models endpoint. Fix: MLX-LM server implements a subset of the OpenAI API — /v1/chat/completions and /v1/completions are supported, but /v1/models, /v1/embeddings, and function calling may not be. Check the latest MLX-LM server README for current endpoint coverage.
Stack & relationships
How MLX-LM relates to other entries in the catalog — recommended pairings, alternatives, dependencies, and edges to avoid. Each edge carries a one-line operator note from our editorial team.
Recommended stack
- Pairs withExo
Exo is how you scale MLX-LM beyond a single Mac. The 2026 unlock — Thunderbolt 5 + macOS 26.2 RDMA — makes the cluster credible for serious models.
Alternatives
- Alternative tollama.cpp
On Apple Silicon, MLX-LM is now competitive with llama.cpp Metal — especially on long-context workloads. Pick MLX if you want Apple-native; pick llama.cpp if you want cross-platform GGUF compatibility.
Depends on
- Depends onExo
Exo runs MLX under the hood for the per-device inference layer. Pipeline-parallel scheduling is Exo; the actual matmul kernels are MLX.
Avoid pairing with
- Incompatible withvLLM
Different ecosystems entirely — vLLM is GPU/Linux/CUDA, MLX-LM is Apple Silicon/Metal. They don't compete; they don't pair. Listed here so the page graph makes the boundary explicit.
- Incompatible withTensorRT-LLM
NVIDIA-only vs Apple-only. Same boundary as vLLM↔MLX. Surface explicitly so readers don't assume cross-platform.
- Incompatible withSGLang
NVIDIA-CUDA-mature vs Apple-Silicon-only. Surface the boundary explicitly to prevent cross-platform assumptions.
Featured in these stacks
The L3 execution stacks that pick this tool as a recommended component, with the one-line note explaining the role it plays in each.
- Stack · L3·Workstation tier·Role: Inference engine (Apple-native)Build a Mac-native AI stack (May 2026)
MLX-LM over llama.cpp on M-series silicon: matched throughput on short context, ~15-25% faster on long context (32K+), and the path that pairs with Exo for cluster scaling. Use llama.cpp when you need GGUF quants MLX hasn't picked up yet.
- Stack · L3·Production tier·Role: Inference engine (per-node)Build a multi-machine Apple Silicon cluster (May 2026)
MLX-LM runs on each cluster node as the per-device inference layer. Exo orchestrates; MLX executes. Long-context performance on M-series silicon is now stronger than llama.cpp Metal — pick MLX over Ollama for cluster deployments specifically.
Pros
- Native Apple Silicon
- Active Apple development
- Strong long-context
Cons
- Apple-only
- MLX quant format separate from GGUF
Compatibility
| Operating systems | macOS |
| GPU backends | Apple Metal |
| License | Open source · free |
Runtime health
Operator-grade signals on how actively MLX-LM is being maintained, how fresh its measurements are, and what failure classes operators have flagged. Every label below is anchored to a real date or count — we never infer maintainer activity we can't show.
Release cadence
Derived from the most recent editorial signal on this row.
8 days since last refresh · source: lastUpdated
Benchmark freshness
How recent the editorial measurements on this runtime are.
No editorial benchmarks for this runtime yet.
Community reproduction
Submissions that match an editorial measurement on similar hardware.
No community reproductions on file yet.
Ecosystem stability
Editorial rating from RunLocalAI — qualitative, not measured.
Get MLX-LM
Frequently asked
Is MLX-LM free?
What operating systems does MLX-LM support?
Which GPUs work with MLX-LM?
Reviewed by RunLocalAI Editorial. See our editorial policy for how we evaluate tools.
Related — keep moving
Verify MLX-LM runs on your specific hardware before committing money.