MLC LLM
TVM-based LLM compilation framework. Compiles models for any GPU with a Vulkan / Metal / WebGPU / CUDA backend. The most-deployed cross-platform on-device LLM runtime — runs Llama, Phi, Gemma, Qwen on phones, browsers, and laptops without per-platform rewrites.
Overview
TVM-based LLM compilation framework. Compiles models for any GPU with a Vulkan / Metal / WebGPU / CUDA backend. The most-deployed cross-platform on-device LLM runtime — runs Llama, Phi, Gemma, Qwen on phones, browsers, and laptops without per-platform rewrites.
Setup guidance
Install via pip: pip install mlc-llm. Requires Python 3.10+ and a supported runtime: CUDA 12.1+ (NVIDIA), Metal (Apple Silicon), Vulkan (all GPUs including Intel iGPU), or ROCm (AMD). MLC-LLM works differently from most engines: models must be compiled to a platform-specific library via TVM Unity before inference. For pre-compiled models: mlc_llm chat HF://mlc-ai/Llama-3.2-3B-Instruct-q4f16_1-MLC — this downloads a pre-compiled model and starts an interactive CLI. For server mode: mlc_llm serve HF://mlc-ai/Llama-3.2-3B-Instruct-q4f16_1-MLC --port 8080. The server exposes an OpenAI-compatible API at /v1/chat/completions. Verify: curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"default","messages":[{"role":"user","content":"Hello"}]}'. For custom models, compile with mlc_llm compile <model-path> --device <target>. First run downloads the pre-compiled model package (~4–6 GB for a 7B) and starts in 3–8 minutes. Time-to-first-response from zero: ~3 minutes with pre-compiled model. MLC-LLM also supports WebGPU (browser) and iOS/Android via native runtime.
Workload fit
Best for: cross-platform local inference where the same model must run on phone, laptop, and server, WebGPU browser-based inference (in-browser LLM demos without a backend server), mobile on-device inference with optimized native runtimes, heterogeneous GPU deployments (Intel Arc, AMD Radeon, Qualcomm Adreno, Apple GPU) where CUDA-only engines can't deploy, research and experimentation with model compilation pipelines. Not suited for: rapid model iteration where compilation cost kills velocity (use Ollama or llama.cpp), maximum-throughput NVIDIA datacenter serving (use vLLM), GGUF-based model ecosystems without MLC-format re-compilation, users who need point-and-click setup (MLC-LLM requires compilation awareness).
Alternatives
Use MLC-LLM when you need inference across the widest range of device targets — Windows, Linux, macOS, iOS, Android, and WebGPU (browser) from a single compilation pipeline. MLC-LLM's TVM-based compilation approach produces the best GPU utilization on non-NVIDIA hardware (Intel iGPU, Mali, Adreno mobile GPUs) of any engine. Switch to llama.cpp when you need instant model loading without a compilation step — MLC-LLM requires pre-compiled model packages. Use vLLM for NVIDIA datacenter production serving where throughput matters more than deployment breadth. Use MLX-LM on Apple Silicon for simpler setup — MLC-LLM works on Apple Silicon but requires the compilation step that MLX-LM skips. MLC-LLM's unique value is "write once, deploy everywhere" — the same compiled model runs on a phone, laptop, and server.
Troubleshooting + when to switch
Problem: TVMError: Cannot find tuned kernel for target <gpu_arch>. Fix: The pre-compiled model was built for a different GPU architecture. Download a model compiled for your specific target: mlc_llm chat HF://mlc-ai/Llama-3.2-3B-Instruct-q4f16_1-MLC --device vulkan:1.3 for Vulkan, --device metal for Apple, --device cuda for NVIDIA. MLC-LLM model packages are device-specific. Problem: Compilation from source takes hours. Fix: MLC-LLM model compilation is TVM-level auto-tuning — it searches a kernel space for optimal tensor operations. Use --opt O2 instead of O3 for faster compilation with 5–10% throughput loss. For development, always use pre-compiled models from the MLC-AI org on HuggingFace. Problem: WebGPU browser deployment fails on Firefox. Fix: WebGPU model serving requires Chrome/Edge (Chromium) with WebGPU enabled. Firefox WebGPU support is behind a flag and not production-ready. Test on Chrome Canary or Edge Dev with --enable-unsafe-webgpu flag.
Featured in this stack
The L3 execution stacks that pick this tool as a recommended component, with the one-line note explaining the role it plays in each.
- Stack · L3·Homelab tier·Role: Cross-device runtime (Adreno GPU path)Android on-device AI stack — Phi-3.5 Mini / Llama 3.2 3B via MLC LLM or Qualcomm AI Hub
MLC LLM is the cross-platform choice. Same model checkpoint compiles for Adreno GPU + iOS Metal + WebGPU. The right pick when you need Android + iOS shipping from one toolchain. Adreno path doesn't use the Hexagon NPU.
Pros
- Cross-platform via TVM — same model compiles for iOS/Android/Web/desktop
- Strongest mobile LLM benchmark numbers as of 2026
- WebGPU path enables in-browser LLM inference
Cons
- Compile-time overhead is real — not a drop-in runtime
- Quant ecosystem narrower than llama.cpp (relies on TVM-specific quants)
- Documentation density trails llama.cpp / vLLM
Compatibility
| Operating systems | iOS Android Windows macOS Linux |
| GPU backends | NVIDIA AMD Apple Qualcomm Adreno Mali |
| License | Open source · free + open-source |
Runtime health
Operator-grade signals on how actively MLC LLM is being maintained, how fresh its measurements are, and what failure classes operators have flagged. Every label below is anchored to a real date or count — we never infer maintainer activity we can't show.
Release cadence
Derived from the most recent editorial signal on this row.
5 days since last refresh · source: lastUpdated
Benchmark freshness
How recent the editorial measurements on this runtime are.
No editorial benchmarks for this runtime yet.
Community reproduction
Submissions that match an editorial measurement on similar hardware.
No community reproductions on file yet.
Get MLC LLM
Frequently asked
Is MLC LLM free?
What operating systems does MLC LLM support?
Which GPUs work with MLC LLM?
Reviewed by RunLocalAI Editorial. See our editorial policy for how we evaluate tools.
Related — keep moving
Verify MLC LLM runs on your specific hardware before committing money.