RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Tools
  4. /MLX-LM
runner
Open source
free
4.5/5

MLX-LM

Apple's Metal-native ML framework's LLM runner. Now competitive with llama.cpp Metal on M-series silicon, with better long-context performance.

By Fredoline Eruo·Last verified Jun 12, 2026·4,200 GitHub stars

Overview

What MLX-LM actually is

MLX-LM is the canonical Python inference library for Apple Silicon, built on Apple's MLX array framework. It is not a wrapper around llama.cpp — it is a fundamentally different code path with native Metal kernels written by Apple's MLX team, lazy-evaluated computation graphs, and unified-memory awareness baked into the design.

For Apple Silicon, MLX-LM is the highest-throughput first-party inference path in 2026. It outperforms llama.cpp's Metal backend by 15-35 % on M2 / M3 / M4 generations on most decoder-only models, and it is the only path that ships native MLX-4bit / MLX-8bit quantization formats designed specifically for unified-memory bandwidth profiles.

Where it fits in the stack

MLX-LM lives at the engine layer for Apple Silicon, full stop. The stack on macOS:

  • Frontend: Open WebUI, LM Studio (LM Studio's MLX backend uses MLX-LM internals), or any OAI-compatible client
  • Server: mlx_lm.server exposes an OpenAI-compatible endpoint
  • Engine: MLX-LM
  • Hardware: Apple Silicon — M1 through M4, including M-series Pro / Max / Ultra and iPad Pro M-series

It is not a Linux engine, not a Windows engine, not a CUDA engine. If your fleet is mixed-OS, llama.cpp is the cross-platform fallback — but on a Mac dev box or a Mac Studio inference workstation, MLX-LM is the right answer.

Best use cases

  • Local LLM dev on a MacBook Pro M3 / M4 Max. 32-128 GB unified memory means you can prototype 70B-class models that would not fit on a single 24 GB consumer NVIDIA card.
  • Mac Studio inference workstation. M3 Ultra with 192 GB unified memory + MLX-LM is the cheapest path to running Llama 3.1 70B at FP16 anywhere outside a datacenter.
  • Apple Silicon-resident agentic stacks. Pair with the same memory + MCP toolset as /stacks/local-coding-agent but route inference through MLX-LM instead of vLLM.
  • Battery-aware inference research. MLX-LM's lazy evaluation and unified-memory model mean it idles cheaply between requests.

OS support

OS Quality
macOS 14+ (Apple Silicon) excellent — only supported target
macOS 13 (Ventura) partial — works for older MLX versions; new releases require Sonoma+
Anything else unsupported

Hardware / backend support

Apple Silicon only — M1, M1 Pro, M1 Max, M1 Ultra, M2 / Pro / Max / Ultra, M3 / Pro / Max / Ultra, M4 / Pro / Max. The performance ladder roughly tracks memory bandwidth, not raw GPU FLOPs:

  • M1 / M2 / M3 / M4 (base) — ~100 GB/s; usable for 7B-class
  • M-series Pro — ~150-200 GB/s; comfortable for 13B-class
  • M-series Max — ~300-400 GB/s; comfortable for 32B-class at 4-bit
  • M-series Ultra — ~800 GB/s; the realistic 70B+ tier

The Neural Engine on Apple Silicon is not used by MLX-LM — all compute goes through the integrated GPU via Metal. ANE is fixed-function and not addressable for arbitrary transformer kernels.

Model / quant format support

  • MLX-4bit / MLX-8bit — native quants; fastest path; conversion via mlx_lm.convert
  • FP16 / BF16 — full-precision baseline; the 192 GB Ultra makes this realistic for 70B
  • GGUF — partial support via conversion; not the recommended path
  • AWQ / GPTQ / EXL2 — unsupported; these are CUDA-kernel-bound formats

If you want the format-by-format breakdown across runtimes see /systems/quantization-formats.

Setup path

pip install mlx-lm
mlx_lm.generate --model mlx-community/Llama-3.1-8B-Instruct-4bit --prompt "Hello"

For an HTTP server:

mlx_lm.server --model mlx-community/Llama-3.1-8B-Instruct-4bit --port 8080

The mlx-community Hugging Face org hosts pre-converted MLX-4bit checkpoints for most popular open-weight models. Conversion of an arbitrary HF safetensors model is one command:

mlx_lm.convert --hf-path meta-llama/Llama-3.1-8B-Instruct -q

What breaks first

  1. Metal OOM mid-generation. Unified memory swaps to disk silently before refusing; tokens-per-sec collapses to ~1. See /errors/metal-out-of-memory.
  2. Models without an MLX checkpoint published. Conversion is one command but takes 10-30 min and needs the original HF safetensors local.
  3. Long-context decode quality on small Macs. The 8 GB / 16 GB base M-series machines run out of KV-cache headroom past ~8K tokens on 7B-class models.
  4. Pip dependency drift. MLX moves quickly; a working environment can break on a pip install -U. Pin versions in production.
  5. Unsupported architectures landing late. Brand-new model families (Mamba-2, RWKV-7, novel MoE routers) sometimes arrive in MLX 1-3 months after llama.cpp.

Alternatives by intent

If you want… Reach for
Friendly UX, same-ish performance on Mac LM Studio (uses MLX under the hood)
Cross-platform, GGUF llama.cpp or Ollama
Apple-native fine-tuning MLX (the framework, not just MLX-LM)
CUDA-class throughput move to RTX 4090 + vLLM

Best pairings

  • Open WebUI — point at the MLX-LM HTTP endpoint
  • Continue.dev / Aider — coding-agent workflows on Mac dev machines
  • Apple M3 Ultra — the canonical inference-workstation pairing for 70B-class models
  • Apple M4 Max — the canonical battery-aware dev pairing

Who should avoid MLX-LM

  • Anyone on Linux / Windows. Period.
  • Multi-tenant production serving with concurrent users. MLX-LM serves one stream well, not 50.
  • Workloads needing AWQ-INT4 fit. Apple Silicon has its own quant story; MLX-4bit ≠ AWQ-INT4 in either format or kernel design.
  • Teams that need reproducible Linux builds. The Mac-only target is a real constraint.

Related

  • Stacks: /stacks/multi-machine-apple-cluster, /stacks/local-coding-agent
  • System guides: /systems/quantization-formats, /setup
  • Hardware: Apple M3 Ultra, Apple M4 Max
  • Errors: /errors/metal-out-of-memory

Setup guidance

Install on macOS (Apple Silicon only, M1+): pip install mlx-lm. Requires Python 3.10+ and macOS 14.0+. MLX-LM uses Apple's MLX framework — Metal GPU acceleration without CUDA. Run a model: mlx_lm.generate --model mlx-community/Llama-3.2-3B-Instruct-4bit --prompt "Hello". For server mode: mlx_lm.server --model mlx-community/Llama-3.2-3B-Instruct-4bit --port 8080. The server exposes OpenAI-compatible /v1/chat/completions at port 8080. Verify: curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"messages":[{"role":"user","content":"Hello"}]}'. MLX models use Apple's MLX format (.safetensors with MLX-compatible config), distinct from GGUF or HF-native formats. The mlx-community HuggingFace org hosts pre-converted models — search for mlx-community/<model-name>-4bit. First run downloads the model checkpoint; a 4-bit 7B model is ~4 GB and downloads in 2–5 minutes on good WiFi. Time-to-first-response: ~5 seconds after download for model load + Metal shader compilation. MLX-LM auto-detects Apple Silicon and uses the Neural Engine for some operations.

Workload fit

Best for: Apple Silicon-native local inference (M1–M4 family, including iPad Pros with M-series chips), macOS/iOS app development embedding LLM inference via MLX Swift, developers in the Apple ecosystem who value Metal-first tooling, single-user chat and code generation on MacBook Pro, Mac Studio, or iMac, scenarios where unified memory lets you run larger models than a comparable NVIDIA consumer GPU (e.g. 64 GB unified memory vs 24 GB VRAM). Not suited for: non-Apple hardware, production multi-tenant serving (use vLLM on NVIDIA), Windows or Linux deployment (MLX is macOS-specific), GGUF-based model ecosystems without format conversion, training or fine-tuning of large models (MLX-LM is primarily inference-focused).

Alternatives

Use MLX-LM when on Apple Silicon and you want the best Metal GPU utilization — MLX benchmarks 5–15% faster throughput than llama.cpp Metal on many models due to optimized unified-memory access patterns and Apple Neural Engine integration. The Swift/Python interop is strong: MLX models can be loaded in Swift apps via the MLX Swift library, making it the right choice for native macOS/iOS LLM apps. Switch to llama.cpp when you need the broadest model format support — MLX format has a narrower pre-converted model catalog. Use Ollama on macOS when you want zero-config setup with automatic model downloads; Ollama wraps llama.cpp on Metal but handles model management. Use LM Studio on macOS when you want a GUI. For non-Apple hardware, MLX has no support — use vLLM (NVIDIA) or llama.cpp (everything else).

Troubleshooting + when to switch

Problem: ValueError: Unsupported model type when loading a HuggingFace model directly. Fix: MLX-LM only loads MLX-format models. Convert with: mlx_lm.convert --hf-path meta-llama/Llama-3.2-3B-Instruct --mlx-path ./converted-model -q. The -q flag quantizes during conversion. Or use pre-converted models from mlx-community on HuggingFace. Problem: Performance slower than expected on M3 Max/M4 Max. Fix: MLX-LM's Metal shaders may be cold-compiling on first run. Run a warmup generation first. Ensure macOS is 15.0+ for M4 Neural Engine support. Check mlx.core.metal.is_available() returns True. Some models require --max-kv-size to be raised for long context. Problem: Server doesn't support /v1/models endpoint. Fix: MLX-LM server implements a subset of the OpenAI API — /v1/chat/completions and /v1/completions are supported, but /v1/models, /v1/embeddings, and function calling may not be. Check the latest MLX-LM server README for current endpoint coverage.

Stack & relationships

How MLX-LM relates to other entries in the catalog — recommended pairings, alternatives, dependencies, and edges to avoid. Each edge carries a one-line operator note from our editorial team.

MLX-LM ↔ ecosystem

Recommended stack

  • Pairs with
    Exo

    Exo is how you scale MLX-LM beyond a single Mac. The 2026 unlock — Thunderbolt 5 + macOS 26.2 RDMA — makes the cluster credible for serious models.

Alternatives

  • Alternative to
    llama.cpp

    On Apple Silicon, MLX-LM is now competitive with llama.cpp Metal — especially on long-context workloads. Pick MLX if you want Apple-native; pick llama.cpp if you want cross-platform GGUF compatibility.

Depends on

  • Depends on
    Exo

    Exo runs MLX under the hood for the per-device inference layer. Pipeline-parallel scheduling is Exo; the actual matmul kernels are MLX.

Avoid pairing with

  • Incompatible with
    vLLM

    Different ecosystems entirely — vLLM is GPU/Linux/CUDA, MLX-LM is Apple Silicon/Metal. They don't compete; they don't pair. Listed here so the page graph makes the boundary explicit.

  • Incompatible with
    TensorRT-LLM

    NVIDIA-only vs Apple-only. Same boundary as vLLM↔MLX. Surface explicitly so readers don't assume cross-platform.

  • Incompatible with
    SGLang

    NVIDIA-CUDA-mature vs Apple-Silicon-only. Surface the boundary explicitly to prevent cross-platform assumptions.

Featured in these stacks

The L3 execution stacks that pick this tool as a recommended component, with the one-line note explaining the role it plays in each.

  • Stack · L3·Workstation tier·Role: Inference engine (Apple-native)
    Build a Mac-native AI stack (May 2026)

    MLX-LM over llama.cpp on M-series silicon: matched throughput on short context, ~15-25% faster on long context (32K+), and the path that pairs with Exo for cluster scaling. Use llama.cpp when you need GGUF quants MLX hasn't picked up yet.

  • Stack · L3·Production tier·Role: Inference engine (per-node)
    Build a multi-machine Apple Silicon cluster (May 2026)

    MLX-LM runs on each cluster node as the per-device inference layer. Exo orchestrates; MLX executes. Long-context performance on M-series silicon is now stronger than llama.cpp Metal — pick MLX over Ollama for cluster deployments specifically.

Pros

  • Native Apple Silicon
  • Active Apple development
  • Strong long-context

Cons

  • Apple-only
  • MLX quant format separate from GGUF

Compatibility

Operating systems
macOS
GPU backends
Apple Metal
LicenseOpen source · free

Runtime health

Operator-grade signals on how actively MLX-LM is being maintained, how fresh its measurements are, and what failure classes operators have flagged. Every label below is anchored to a real date or count — we never infer maintainer activity we can't show.

Release cadence

Derived from the most recent editorial signal on this row.

Active
Updated Jun 12, 2026

8 days since last refresh · source: lastUpdated

Benchmark freshness

How recent the editorial measurements on this runtime are.

0editorial benchmarks

No editorial benchmarks for this runtime yet.

Community reproduction

Submissions that match an editorial measurement on similar hardware.

0reproduced reports

No community reproductions on file yet.

Ecosystem stability

Editorial rating from RunLocalAI — qualitative, not measured.

4.5/5✓Editorial

Get MLX-LM

GitHub
https://github.com/ml-explore/mlx-lm

Frequently asked

Is MLX-LM free?

Yes — MLX-LM is free to use and open-source.

What operating systems does MLX-LM support?

MLX-LM supports macOS.

Which GPUs work with MLX-LM?

MLX-LM supports Apple Metal. CPU-only operation is also possible but typically slower.
See something off?Report outdated·Suggest a correctionWe read every submission. Editorial review takes 1-7 days.

Reviewed by RunLocalAI Editorial. See our editorial policy for how we evaluate tools.

Related — keep moving

Compare hardware
  • M4 Max vs RTX 4090 →
  • M3 Ultra vs dual 3090 →
Buyer guides
  • Best Mac for local AI →
  • Best budget Mac →
When it doesn't work
  • MLX out of memory →
  • MPS falling back to CPU →
  • llama.cpp Metal crash →
Recommended hardware
  • Apple M4 Max →
  • Mac Studio M3 Ultra →
Alternatives
LlamafileExLlamaV2IPEX-LLMIntel OpenVINODirectMLllama-cpp-pythonAphrodite EngineONNX Runtime Mobile
Before you buy

Verify MLX-LM runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →