RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Tools
  4. /MLC LLM
runner
Open source
free + open-source

MLC LLM

TVM-based LLM compilation framework. Compiles models for any GPU with a Vulkan / Metal / WebGPU / CUDA backend. The most-deployed cross-platform on-device LLM runtime — runs Llama, Phi, Gemma, Qwen on phones, browsers, and laptops without per-platform rewrites.

By Fredoline Eruo·Last verified May 9, 2026·19,000 GitHub stars

Overview

TVM-based LLM compilation framework. Compiles models for any GPU with a Vulkan / Metal / WebGPU / CUDA backend. The most-deployed cross-platform on-device LLM runtime — runs Llama, Phi, Gemma, Qwen on phones, browsers, and laptops without per-platform rewrites.

Setup guidance

Install via pip: pip install mlc-llm. Requires Python 3.10+ and a supported runtime: CUDA 12.1+ (NVIDIA), Metal (Apple Silicon), Vulkan (all GPUs including Intel iGPU), or ROCm (AMD). MLC-LLM works differently from most engines: models must be compiled to a platform-specific library via TVM Unity before inference. For pre-compiled models: mlc_llm chat HF://mlc-ai/Llama-3.2-3B-Instruct-q4f16_1-MLC — this downloads a pre-compiled model and starts an interactive CLI. For server mode: mlc_llm serve HF://mlc-ai/Llama-3.2-3B-Instruct-q4f16_1-MLC --port 8080. The server exposes an OpenAI-compatible API at /v1/chat/completions. Verify: curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"default","messages":[{"role":"user","content":"Hello"}]}'. For custom models, compile with mlc_llm compile <model-path> --device <target>. First run downloads the pre-compiled model package (~4–6 GB for a 7B) and starts in 3–8 minutes. Time-to-first-response from zero: ~3 minutes with pre-compiled model. MLC-LLM also supports WebGPU (browser) and iOS/Android via native runtime.

Workload fit

Best for: cross-platform local inference where the same model must run on phone, laptop, and server, WebGPU browser-based inference (in-browser LLM demos without a backend server), mobile on-device inference with optimized native runtimes, heterogeneous GPU deployments (Intel Arc, AMD Radeon, Qualcomm Adreno, Apple GPU) where CUDA-only engines can't deploy, research and experimentation with model compilation pipelines. Not suited for: rapid model iteration where compilation cost kills velocity (use Ollama or llama.cpp), maximum-throughput NVIDIA datacenter serving (use vLLM), GGUF-based model ecosystems without MLC-format re-compilation, users who need point-and-click setup (MLC-LLM requires compilation awareness).

Alternatives

Use MLC-LLM when you need inference across the widest range of device targets — Windows, Linux, macOS, iOS, Android, and WebGPU (browser) from a single compilation pipeline. MLC-LLM's TVM-based compilation approach produces the best GPU utilization on non-NVIDIA hardware (Intel iGPU, Mali, Adreno mobile GPUs) of any engine. Switch to llama.cpp when you need instant model loading without a compilation step — MLC-LLM requires pre-compiled model packages. Use vLLM for NVIDIA datacenter production serving where throughput matters more than deployment breadth. Use MLX-LM on Apple Silicon for simpler setup — MLC-LLM works on Apple Silicon but requires the compilation step that MLX-LM skips. MLC-LLM's unique value is "write once, deploy everywhere" — the same compiled model runs on a phone, laptop, and server.

Troubleshooting + when to switch

Problem: TVMError: Cannot find tuned kernel for target <gpu_arch>. Fix: The pre-compiled model was built for a different GPU architecture. Download a model compiled for your specific target: mlc_llm chat HF://mlc-ai/Llama-3.2-3B-Instruct-q4f16_1-MLC --device vulkan:1.3 for Vulkan, --device metal for Apple, --device cuda for NVIDIA. MLC-LLM model packages are device-specific. Problem: Compilation from source takes hours. Fix: MLC-LLM model compilation is TVM-level auto-tuning — it searches a kernel space for optimal tensor operations. Use --opt O2 instead of O3 for faster compilation with 5–10% throughput loss. For development, always use pre-compiled models from the MLC-AI org on HuggingFace. Problem: WebGPU browser deployment fails on Firefox. Fix: WebGPU model serving requires Chrome/Edge (Chromium) with WebGPU enabled. Firefox WebGPU support is behind a flag and not production-ready. Test on Chrome Canary or Edge Dev with --enable-unsafe-webgpu flag.

Featured in this stack

The L3 execution stacks that pick this tool as a recommended component, with the one-line note explaining the role it plays in each.

  • Stack · L3·Homelab tier·Role: Cross-device runtime (Adreno GPU path)
    Android on-device AI stack — Phi-3.5 Mini / Llama 3.2 3B via MLC LLM or Qualcomm AI Hub

    MLC LLM is the cross-platform choice. Same model checkpoint compiles for Adreno GPU + iOS Metal + WebGPU. The right pick when you need Android + iOS shipping from one toolchain. Adreno path doesn't use the Hexagon NPU.

Pros

  • Cross-platform via TVM — same model compiles for iOS/Android/Web/desktop
  • Strongest mobile LLM benchmark numbers as of 2026
  • WebGPU path enables in-browser LLM inference

Cons

  • Compile-time overhead is real — not a drop-in runtime
  • Quant ecosystem narrower than llama.cpp (relies on TVM-specific quants)
  • Documentation density trails llama.cpp / vLLM

Compatibility

Operating systems
iOS
Android
Windows
macOS
Linux
GPU backends
NVIDIA
AMD
Apple
Qualcomm Adreno
Mali
LicenseOpen source · free + open-source

Runtime health

Operator-grade signals on how actively MLC LLM is being maintained, how fresh its measurements are, and what failure classes operators have flagged. Every label below is anchored to a real date or count — we never infer maintainer activity we can't show.

Release cadence

Derived from the most recent editorial signal on this row.

Active
Updated May 9, 2026

5 days since last refresh · source: lastUpdated

Benchmark freshness

How recent the editorial measurements on this runtime are.

0editorial benchmarks

No editorial benchmarks for this runtime yet.

Community reproduction

Submissions that match an editorial measurement on similar hardware.

0reproduced reports

No community reproductions on file yet.

Get MLC LLM

Official site
https://llm.mlc.ai
GitHub
https://github.com/mlc-ai/mlc-llm

Frequently asked

Is MLC LLM free?

MLC LLM has a paid tier (free + open-source). Check the pricing page for current terms.

What operating systems does MLC LLM support?

MLC LLM supports iOS, Android, Windows, macOS, Linux.

Which GPUs work with MLC LLM?

MLC LLM supports NVIDIA, AMD, Apple, Qualcomm Adreno, Mali. CPU-only inference is also possible but slow.
See something off?Report outdated·Suggest a correctionWe read every submission. Editorial review takes 1-7 days.

Reviewed by RunLocalAI Editorial. See our editorial policy for how we evaluate tools.

Related — keep moving

Compare hardware
  • RTX 3090 vs RTX 4090 →
  • Apple M4 Max vs RTX 4090 →
Buyer guides
  • Best GPU for local AI →
  • Best budget GPU →
When it doesn't work
  • llama.cpp too slow →
  • llama.cpp build failed →
  • llama.cpp Metal crash (Mac) →
  • GGUF tokenizer mismatch →
Recommended hardware
  • RTX 3090 (used) →
  • Apple M4 Max →
Alternatives
MLX-LMExLlamaV2llama.cppLlamafileOllamaIPEX-LLMCTranslate2Intel OpenVINO
Before you buy

Verify MLC LLM runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →