runner

Open source

free

4.5/5

MLX-LM

Apple's Metal-native ML framework's LLM runner. Now competitive with llama.cpp Metal on M-series silicon, with better long-context performance.

By Fredoline Eruo·Last verified Jun 12, 2026·4,200 GitHub stars

Overview

What MLX-LM actually is

MLX-LM is the canonical Python inference library for Apple Silicon, built on Apple's MLX array framework. It is not a wrapper around llama.cpp — it is a fundamentally different code path with native Metal kernels written by Apple's MLX team, lazy-evaluated computation graphs, and unified-memory awareness baked into the design.

For Apple Silicon, MLX-LM is the highest-throughput first-party inference path in 2026. It outperforms llama.cpp's Metal backend by 15-35 % on M2 / M3 / M4 generations on most decoder-only models, and it is the only path that ships native MLX-4bit / MLX-8bit quantization formats designed specifically for unified-memory bandwidth profiles.

Where it fits in the stack

MLX-LM lives at the engine layer for Apple Silicon, full stop. The stack on macOS:

Frontend: Open WebUI, LM Studio (LM Studio's MLX backend uses MLX-LM internals), or any OAI-compatible client
Server: mlx_lm.server exposes an OpenAI-compatible endpoint
Engine: MLX-LM
Hardware: Apple Silicon — M1 through M4, including M-series Pro / Max / Ultra and iPad Pro M-series

It is not a Linux engine, not a Windows engine, not a CUDA engine. If your fleet is mixed-OS, llama.cpp is the cross-platform fallback — but on a Mac dev box or a Mac Studio inference workstation, MLX-LM is the right answer.

Best use cases

Local LLM dev on a MacBook Pro M3 / M4 Max. 32-128 GB unified memory means you can prototype 70B-class models that would not fit on a single 24 GB consumer NVIDIA card.
Mac Studio inference workstation. M3 Ultra with 192 GB unified memory + MLX-LM is the cheapest path to running Llama 3.1 70B at FP16 anywhere outside a datacenter.
Apple Silicon-resident agentic stacks. Pair with the same memory + MCP toolset as /stacks/local-coding-agent but route inference through MLX-LM instead of vLLM.
Battery-aware inference research. MLX-LM's lazy evaluation and unified-memory model mean it idles cheaply between requests.

OS support

OS	Quality
macOS 14+ (Apple Silicon)	excellent — only supported target
macOS 13 (Ventura)	partial — works for older MLX versions; new releases require Sonoma+
Anything else	unsupported

Hardware / backend support

Apple Silicon only — M1, M1 Pro, M1 Max, M1 Ultra, M2 / Pro / Max / Ultra, M3 / Pro / Max / Ultra, M4 / Pro / Max. The performance ladder roughly tracks memory bandwidth, not raw GPU FLOPs:

M1 / M2 / M3 / M4 (base) — ~100 GB/s; usable for 7B-class
M-series Pro — ~150-200 GB/s; comfortable for 13B-class
M-series Max — ~300-400 GB/s; comfortable for 32B-class at 4-bit
M-series Ultra — ~800 GB/s; the realistic 70B+ tier

The Neural Engine on Apple Silicon is not used by MLX-LM — all compute goes through the integrated GPU via Metal. ANE is fixed-function and not addressable for arbitrary transformer kernels.

Model / quant format support

MLX-4bit / MLX-8bit — native quants; fastest path; conversion via mlx_lm.convert
FP16 / BF16 — full-precision baseline; the 192 GB Ultra makes this realistic for 70B
GGUF — partial support via conversion; not the recommended path
AWQ / GPTQ / EXL2 — unsupported; these are CUDA-kernel-bound formats

If you want the format-by-format breakdown across runtimes see /systems/quantization-formats.

Setup path

pip install mlx-lm
mlx_lm.generate --model mlx-community/Llama-3.1-8B-Instruct-4bit --prompt "Hello"

For an HTTP server:

mlx_lm.server --model mlx-community/Llama-3.1-8B-Instruct-4bit --port 8080

The mlx-community Hugging Face org hosts pre-converted MLX-4bit checkpoints for most popular open-weight models. Conversion of an arbitrary HF safetensors model is one command:

mlx_lm.convert --hf-path meta-llama/Llama-3.1-8B-Instruct -q

What breaks first

Metal OOM mid-generation. Unified memory swaps to disk silently before refusing; tokens-per-sec collapses to ~1. See /errors/metal-out-of-memory.
Models without an MLX checkpoint published. Conversion is one command but takes 10-30 min and needs the original HF safetensors local.
Long-context decode quality on small Macs. The 8 GB / 16 GB base M-series machines run out of KV-cache headroom past ~8K tokens on 7B-class models.
Pip dependency drift. MLX moves quickly; a working environment can break on a pip install -U. Pin versions in production.
Unsupported architectures landing late. Brand-new model families (Mamba-2, RWKV-7, novel MoE routers) sometimes arrive in MLX 1-3 months after llama.cpp.

Alternatives by intent

If you want…	Reach for
Friendly UX, same-ish performance on Mac	LM Studio (uses MLX under the hood)
Cross-platform, GGUF	llama.cpp or Ollama
Apple-native fine-tuning	MLX (the framework, not just MLX-LM)
CUDA-class throughput	move to RTX 4090 + vLLM

Best pairings

Open WebUI — point at the MLX-LM HTTP endpoint
Continue.dev / Aider — coding-agent workflows on Mac dev machines
Apple M3 Ultra — the canonical inference-workstation pairing for 70B-class models
Apple M4 Max — the canonical battery-aware dev pairing

Who should avoid MLX-LM

Anyone on Linux / Windows. Period.
Multi-tenant production serving with concurrent users. MLX-LM serves one stream well, not 50.
Workloads needing AWQ-INT4 fit. Apple Silicon has its own quant story; MLX-4bit ≠ AWQ-INT4 in either format or kernel design.
Teams that need reproducible Linux builds. The Mac-only target is a real constraint.

Stacks: /stacks/multi-machine-apple-cluster, /stacks/local-coding-agent
System guides: /systems/quantization-formats, /setup
Hardware: Apple M3 Ultra, Apple M4 Max
Errors: /errors/metal-out-of-memory

Setup guidance

Install on macOS (Apple Silicon only, M1+): pip install mlx-lm. Requires Python 3.10+ and macOS 14.0+. MLX-LM uses Apple's MLX framework — Metal GPU acceleration without CUDA. Run a model: mlx_lm.generate --model mlx-community/Llama-3.2-3B-Instruct-4bit --prompt "Hello". For server mode: mlx_lm.server --model mlx-community/Llama-3.2-3B-Instruct-4bit --port 8080. The server exposes OpenAI-compatible /v1/chat/completions at port 8080. Verify: curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"messages":[{"role":"user","content":"Hello"}]}'. MLX models use Apple's MLX format (.safetensors with MLX-compatible config), distinct from GGUF or HF-native formats. The mlx-community HuggingFace org hosts pre-converted models — search for mlx-community/<model-name>-4bit. First run downloads the model checkpoint; a 4-bit 7B model is ~4 GB and downloads in 2–5 minutes on good WiFi. Time-to-first-response: ~5 seconds after download for model load + Metal shader compilation. MLX-LM auto-detects Apple Silicon and uses the Neural Engine for some operations.

Workload fit

Best for: Apple Silicon-native local inference (M1–M4 family, including iPad Pros with M-series chips), macOS/iOS app development embedding LLM inference via MLX Swift, developers in the Apple ecosystem who value Metal-first tooling, single-user chat and code generation on MacBook Pro, Mac Studio, or iMac, scenarios where unified memory lets you run larger models than a comparable NVIDIA consumer GPU (e.g. 64 GB unified memory vs 24 GB VRAM). Not suited for: non-Apple hardware, production multi-tenant serving (use vLLM on NVIDIA), Windows or Linux deployment (MLX is macOS-specific), GGUF-based model ecosystems without format conversion, training or fine-tuning of large models (MLX-LM is primarily inference-focused).

Alternatives

Use MLX-LM when on Apple Silicon and you want the best Metal GPU utilization — MLX benchmarks 5–15% faster throughput than llama.cpp Metal on many models due to optimized unified-memory access patterns and Apple Neural Engine integration. The Swift/Python interop is strong: MLX models can be loaded in Swift apps via the MLX Swift library, making it the right choice for native macOS/iOS LLM apps. Switch to llama.cpp when you need the broadest model format support — MLX format has a narrower pre-converted model catalog. Use Ollama on macOS when you want zero-config setup with automatic model downloads; Ollama wraps llama.cpp on Metal but handles model management. Use LM Studio on macOS when you want a GUI. For non-Apple hardware, MLX has no support — use vLLM (NVIDIA) or llama.cpp (everything else).

Troubleshooting + when to switch

Problem: ValueError: Unsupported model type when loading a HuggingFace model directly. Fix: MLX-LM only loads MLX-format models. Convert with: mlx_lm.convert --hf-path meta-llama/Llama-3.2-3B-Instruct --mlx-path ./converted-model -q. The -q flag quantizes during conversion. Or use pre-converted models from mlx-community on HuggingFace. Problem: Performance slower than expected on M3 Max/M4 Max. Fix: MLX-LM's Metal shaders may be cold-compiling on first run. Run a warmup generation first. Ensure macOS is 15.0+ for M4 Neural Engine support. Check mlx.core.metal.is_available() returns True. Some models require --max-kv-size to be raised for long context. Problem: Server doesn't support /v1/models endpoint. Fix: MLX-LM server implements a subset of the OpenAI API — /v1/chat/completions and /v1/completions are supported, but /v1/models, /v1/embeddings, and function calling may not be. Check the latest MLX-LM server README for current endpoint coverage.

Stack & relationships

How MLX-LM relates to other entries in the catalog — recommended pairings, alternatives, dependencies, and edges to avoid. Each edge carries a one-line operator note from our editorial team.

MLX-LM ↔ ecosystem

Recommended stack

Pairs with
Exo
Exo is how you scale MLX-LM beyond a single Mac. The 2026 unlock — Thunderbolt 5 + macOS 26.2 RDMA — makes the cluster credible for serious models.

Alternatives

Alternative to
llama.cpp
On Apple Silicon, MLX-LM is now competitive with llama.cpp Metal — especially on long-context workloads. Pick MLX if you want Apple-native; pick llama.cpp if you want cross-platform GGUF compatibility.

Depends on

Depends on
Exo
Exo runs MLX under the hood for the per-device inference layer. Pipeline-parallel scheduling is Exo; the actual matmul kernels are MLX.

Avoid pairing with

Incompatible with
vLLM
Different ecosystems entirely — vLLM is GPU/Linux/CUDA, MLX-LM is Apple Silicon/Metal. They don't compete; they don't pair. Listed here so the page graph makes the boundary explicit.
Incompatible with
TensorRT-LLM
NVIDIA-only vs Apple-only. Same boundary as vLLM↔MLX. Surface explicitly so readers don't assume cross-platform.
Incompatible with
SGLang
NVIDIA-CUDA-mature vs Apple-Silicon-only. Surface the boundary explicitly to prevent cross-platform assumptions.

Featured in these stacks

The L3 execution stacks that pick this tool as a recommended component, with the one-line note explaining the role it plays in each.

Stack · L3·Workstation tier·Role: Inference engine (Apple-native)
Build a Mac-native AI stack (May 2026)
MLX-LM over llama.cpp on M-series silicon: matched throughput on short context, ~15-25% faster on long context (32K+), and the path that pairs with Exo for cluster scaling. Use llama.cpp when you need GGUF quants MLX hasn't picked up yet.
Stack · L3·Production tier·Role: Inference engine (per-node)
Build a multi-machine Apple Silicon cluster (May 2026)
MLX-LM runs on each cluster node as the per-device inference layer. Exo orchestrates; MLX executes. Long-context performance on M-series silicon is now stronger than llama.cpp Metal — pick MLX over Ollama for cluster deployments specifically.

Pros

Native Apple Silicon
Active Apple development
Strong long-context

Cons

Apple-only
MLX quant format separate from GGUF

Compatibility

Operating systems	macOS
GPU backends	Apple Metal
License	Open source · free

Runtime health

Operator-grade signals on how actively MLX-LM is being maintained, how fresh its measurements are, and what failure classes operators have flagged. Every label below is anchored to a real date or count — we never infer maintainer activity we can't show.

Release cadence

Derived from the most recent editorial signal on this row.

Active

Updated Jun 12, 2026

8 days since last refresh · source: lastUpdated

Benchmark freshness

How recent the editorial measurements on this runtime are.

0editorial benchmarks

No editorial benchmarks for this runtime yet.

Community reproduction

Submissions that match an editorial measurement on similar hardware.

0reproduced reports

No community reproductions on file yet.

Ecosystem stability

Editorial rating from RunLocalAI — qualitative, not measured.

4.5/5Editorial

Get MLX-LM

GitHub

https://github.com/ml-explore/mlx-lm

Frequently asked

Is MLX-LM free?

Yes — MLX-LM is free to use and open-source.

What operating systems does MLX-LM support?

MLX-LM supports macOS.

Which GPUs work with MLX-LM?

MLX-LM supports Apple Metal. CPU-only operation is also possible but typically slower.

Reviewed by RunLocalAI Editorial. See our editorial policy for how we evaluate tools.

Related — keep moving

Compare hardware

Buyer guides

When it doesn't work

Recommended hardware

Alternatives

Llamafile ExLlamaV2 IPEX-LLM Intel OpenVINO DirectML llama-cpp-python Aphrodite Engine ONNX Runtime Mobile

Before you buy

Verify MLX-LM runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →