Mobile guide · Runtime comparisonReviewed May 2026

Best mobile AI runtimes (May 2026) — MLC LLM vs ExecuTorch vs llama.cpp vs ONNX vs Qualcomm AI Hub

Cross-platform mobile LLM runtime tier list. Which OS, which chips, which models, setup complexity, maintenance reality. The honest comparison for shipping on iPhone, Android, or both.

By Fredoline Eruo · Last reviewed 2026-05-08

The five runtimes that matter

On mobile in 2026, five runtimes are real: MLC LLM, ExecuTorch, llama.cpp, ONNX Runtime Mobile, and Qualcomm AI Hub. iPhone-only adds MLX Swift as a sixth, and we'll treat it separately because the cross-platform argument doesn't apply.

The wrong question to ask is “which is fastest?” Speed varies by device, model, and quant — and by what you measure. The right question is which one fits the OS surface, the chip family, and the model envelope you actually need to ship. That is what this page answers.

The tier list

Tier S (ship-it picks for most operators):

  • MLC LLM — for cross-platform shipping. Same checkpoint compiles for iOS Metal, Android Adreno (and Mali on some flagships), WebGPU. Active community, model zoo with pre-quantized checkpoints. Setup complexity is medium.
  • llama.cpp — for the “already on desktop, same code on mobile” case. GGUF model zoo is the largest in mobile. Mobile build paths exist for both iOS (Metal) and Android (CPU + OpenCL or Vulkan). Active maintenance.
  • Qualcomm AI Hub — for Snapdragon-only apps that need maximum NPU throughput. Best raw performance on Hexagon, but you give up cross-vendor.
  • MLX Swift — for iOS-only apps. Apple's first- party path; same checkpoints as desktop MLX-LM. Not on this main list because the question implies cross-platform; mentioned here for completeness.

Tier A (good fits when constraints match):

  • ExecuTorch — for PyTorch-native model authoring shops that want one toolchain across mobile and edge. Vendor delegate ecosystem is growing. The 2026 trajectory is good but the runtime is younger than llama.cpp.
  • ONNX Runtime Mobile — for shops already on ONNX for desktop or cloud. Stable, well-documented, but model conversion is heavy and the LLM-on-mobile community is smaller than the MLC LLM / llama.cpp camps.

The cross-axis comparison

RuntimeiOSAndroidNPU pathModel formatSetup
MLC LLMYes (Metal)Yes (Adreno + Mali)No (GPU only)MLC-compiledMedium
llama.cppYes (Metal)Yes (CPU + OpenCL/Vulkan)NoGGUFEasy-Medium
ExecuTorchYesYesVendor delegatesPyTorch ETMHard
ONNX Runtime MobileYesYesNNAPI (deprecated), Core MLONNXHard
Qualcomm AI HubNoSnapdragon onlyYes (Hexagon)QNN-compiledMedium
MLX SwiftYesNoANE via Core MLMLX 4-bitMedium

How to pick (decision flow)

  1. Cross-platform iOS + Android, single codebase: MLC LLM. The only mature runtime where the same compiled artifact runs on both. ExecuTorch is catching up but is still harder to ship.
  2. iOS-only, Apple-first design language: MLX Swift. Apple-maintained, same model checkpoints as the Mac MLX-LM ecosystem. See the iPhone on-device AI stack.
  3. Android-only, Snapdragon-only, throughput-critical: Qualcomm AI Hub. Lock-in is the price for the Hexagon NPU advantage. See the Android on-device AI stack for the comparison with MLC LLM.
  4. You already use llama.cpp on desktop and want minimal new toolchain: llama.cpp. GGUF is the most-portable model format. iOS Metal and Android CPU paths both ship. Skip if you want the absolute best mobile throughput.
  5. PyTorch-native workshop with cross-platform mobile + edge ambitions: ExecuTorch. Newer, harder, but the trajectory is the right one if your team is already on PyTorch.
  6. ONNX-first existing pipeline: ONNX Runtime Mobile. Otherwise skip — there's no reason to start a new mobile LLM project on ONNX in 2026 unless you have an existing ONNX workflow you don't want to disturb.

Where we have measured numbers (and where we don't)

We deliberately avoid one-line tok/s comparisons across runtimes because the comparison is meaningless without device, model, quant, and thermal state pinned. The current state of measured coverage lives at /benchmarks/mobile-edge: we list the devices where we have measurements, the runtimes that produced them, and the gaps. If you've measured a configuration we don't have, contribute at /submit/benchmark; if there's a measurement you want us to commission, file at /benchmarks/request.

Two general 2026 patterns worth flagging:

  • On Snapdragon flagships, Qualcomm AI Hub Hexagon path tends to beat MLC LLM Adreno path by a meaningful margin per Qualcomm's published numbers; community measurements broadly agree.
  • On iPhone, MLX Swift and MLC LLM are within ~10-15% of each other on most 3B Q4 workloads. The difference is more about toolchain ergonomics than throughput.

Maintenance reality and ecosystem health

A runtime is only as useful as its maintenance trajectory. As of May 2026:

  • llama.cpp: weekly releases, dominant mobile- community gravity for GGUF, ARM SIMD optimizations land regularly. Most stable bet for “will this still work in 2 years?”
  • MLC LLM: monthly releases, active model-zoo updates. Cross-device compilation is the moat.
  • ExecuTorch: PyTorch-team maintained, growing quickly, vendor delegate ecosystem expanding. Worth betting on for new projects with a 12+ month timeline.
  • ONNX Runtime Mobile: Microsoft-maintained, stable, but mobile-LLM-specific features land slower than the others.
  • Qualcomm AI Hub: Qualcomm-maintained for the chip generations they care about. Older Snapdragon support drops on each new flagship; plan for the 2-year support window.

Common failure modes across all of them

  • The model loads but inference is silently slow. Almost always thermal throttle. Cool the device, retry.
  • Quant format mismatch. MLC-compiled artifact won't load in llama.cpp; QNN model won't load in MLC. Each runtime has its own format; you cannot share quantized checkpoints between them.
  • OS update breaks the build. iOS and Android both occasionally break GPU compute or NPU paths in major releases. Test on developer betas; pin minimum-OS in your manifest.
  • App size review pushback. App Store and Play Store both push back on apps that bundle >500 MB models. Download-on-first-launch is the standard pattern.
  • Vendor SDK version drift. Qualcomm QNN, Apple Core ML, Google AICore — all have version mismatches that surface as silent quality regressions, not loud crashes.

Going deeper

Pick your runtime path

Most operators shipping iOS + Android from one codebase land here.

Specialized buyer guides
Updated 2026 roundup