Best mobile AI runtimes (May 2026) — MLC LLM vs ExecuTorch vs llama.cpp vs ONNX vs Qualcomm AI Hub
Cross-platform mobile LLM runtime tier list. Which OS, which chips, which models, setup complexity, maintenance reality. The honest comparison for shipping on iPhone, Android, or both.
The five runtimes that matter
On mobile in 2026, five runtimes are real: MLC LLM, ExecuTorch, llama.cpp, ONNX Runtime Mobile, and Qualcomm AI Hub. iPhone-only adds MLX Swift as a sixth, and we'll treat it separately because the cross-platform argument doesn't apply.
The wrong question to ask is “which is fastest?” Speed varies by device, model, and quant — and by what you measure. The right question is which one fits the OS surface, the chip family, and the model envelope you actually need to ship. That is what this page answers.
The tier list
Tier S (ship-it picks for most operators):
- MLC LLM — for cross-platform shipping. Same checkpoint compiles for iOS Metal, Android Adreno (and Mali on some flagships), WebGPU. Active community, model zoo with pre-quantized checkpoints. Setup complexity is medium.
- llama.cpp — for the “already on desktop, same code on mobile” case. GGUF model zoo is the largest in mobile. Mobile build paths exist for both iOS (Metal) and Android (CPU + OpenCL or Vulkan). Active maintenance.
- Qualcomm AI Hub — for Snapdragon-only apps that need maximum NPU throughput. Best raw performance on Hexagon, but you give up cross-vendor.
- MLX Swift — for iOS-only apps. Apple's first- party path; same checkpoints as desktop MLX-LM. Not on this main list because the question implies cross-platform; mentioned here for completeness.
Tier A (good fits when constraints match):
- ExecuTorch — for PyTorch-native model authoring shops that want one toolchain across mobile and edge. Vendor delegate ecosystem is growing. The 2026 trajectory is good but the runtime is younger than llama.cpp.
- ONNX Runtime Mobile — for shops already on ONNX for desktop or cloud. Stable, well-documented, but model conversion is heavy and the LLM-on-mobile community is smaller than the MLC LLM / llama.cpp camps.
The cross-axis comparison
| Runtime | iOS | Android | NPU path | Model format | Setup |
|---|---|---|---|---|---|
| MLC LLM | Yes (Metal) | Yes (Adreno + Mali) | No (GPU only) | MLC-compiled | Medium |
| llama.cpp | Yes (Metal) | Yes (CPU + OpenCL/Vulkan) | No | GGUF | Easy-Medium |
| ExecuTorch | Yes | Yes | Vendor delegates | PyTorch ETM | Hard |
| ONNX Runtime Mobile | Yes | Yes | NNAPI (deprecated), Core ML | ONNX | Hard |
| Qualcomm AI Hub | No | Snapdragon only | Yes (Hexagon) | QNN-compiled | Medium |
| MLX Swift | Yes | No | ANE via Core ML | MLX 4-bit | Medium |
How to pick (decision flow)
- Cross-platform iOS + Android, single codebase: MLC LLM. The only mature runtime where the same compiled artifact runs on both. ExecuTorch is catching up but is still harder to ship.
- iOS-only, Apple-first design language: MLX Swift. Apple-maintained, same model checkpoints as the Mac MLX-LM ecosystem. See the iPhone on-device AI stack.
- Android-only, Snapdragon-only, throughput-critical: Qualcomm AI Hub. Lock-in is the price for the Hexagon NPU advantage. See the Android on-device AI stack for the comparison with MLC LLM.
- You already use llama.cpp on desktop and want minimal new toolchain: llama.cpp. GGUF is the most-portable model format. iOS Metal and Android CPU paths both ship. Skip if you want the absolute best mobile throughput.
- PyTorch-native workshop with cross-platform mobile + edge ambitions: ExecuTorch. Newer, harder, but the trajectory is the right one if your team is already on PyTorch.
- ONNX-first existing pipeline: ONNX Runtime Mobile. Otherwise skip — there's no reason to start a new mobile LLM project on ONNX in 2026 unless you have an existing ONNX workflow you don't want to disturb.
Where we have measured numbers (and where we don't)
We deliberately avoid one-line tok/s comparisons across runtimes because the comparison is meaningless without device, model, quant, and thermal state pinned. The current state of measured coverage lives at /benchmarks/mobile-edge: we list the devices where we have measurements, the runtimes that produced them, and the gaps. If you've measured a configuration we don't have, contribute at /submit/benchmark; if there's a measurement you want us to commission, file at /benchmarks/request.
Two general 2026 patterns worth flagging:
- On Snapdragon flagships, Qualcomm AI Hub Hexagon path tends to beat MLC LLM Adreno path by a meaningful margin per Qualcomm's published numbers; community measurements broadly agree.
- On iPhone, MLX Swift and MLC LLM are within ~10-15% of each other on most 3B Q4 workloads. The difference is more about toolchain ergonomics than throughput.
Maintenance reality and ecosystem health
A runtime is only as useful as its maintenance trajectory. As of May 2026:
- llama.cpp: weekly releases, dominant mobile- community gravity for GGUF, ARM SIMD optimizations land regularly. Most stable bet for “will this still work in 2 years?”
- MLC LLM: monthly releases, active model-zoo updates. Cross-device compilation is the moat.
- ExecuTorch: PyTorch-team maintained, growing quickly, vendor delegate ecosystem expanding. Worth betting on for new projects with a 12+ month timeline.
- ONNX Runtime Mobile: Microsoft-maintained, stable, but mobile-LLM-specific features land slower than the others.
- Qualcomm AI Hub: Qualcomm-maintained for the chip generations they care about. Older Snapdragon support drops on each new flagship; plan for the 2-year support window.
Common failure modes across all of them
- The model loads but inference is silently slow. Almost always thermal throttle. Cool the device, retry.
- Quant format mismatch. MLC-compiled artifact won't load in llama.cpp; QNN model won't load in MLC. Each runtime has its own format; you cannot share quantized checkpoints between them.
- OS update breaks the build. iOS and Android both occasionally break GPU compute or NPU paths in major releases. Test on developer betas; pin minimum-OS in your manifest.
- App size review pushback. App Store and Play Store both push back on apps that bundle >500 MB models. Download-on-first-launch is the standard pattern.
- Vendor SDK version drift. Qualcomm QNN, Apple Core ML, Google AICore — all have version mismatches that surface as silent quality regressions, not loud crashes.
Going deeper
- Run local AI on iPhone
- Run local AI on Android
- Can phones run local LLMs?
- iPhone on-device AI stack
- Android on-device AI stack
- Mobile / edge benchmark gap report
- Runtime comparison surface — desktop and server runtimes.
Pick your runtime path
Most operators shipping iOS + Android from one codebase land here.