MLC LLM

MLC LLM

TVM-based LLM compilation framework. Compiles models for any GPU with a Vulkan / Metal / WebGPU / CUDA backend. The most-deployed cross-platform on-device LLM runtime — runs Llama, Phi, Gemma, Qwen on phones, browsers, and laptops without per-platform rewrites.

By Fredoline Eruo·Last verified May 9, 2026·19,000 GitHub stars

Setup guidance

Install via pip: pip install mlc-llm. Requires Python 3.10+ and a supported runtime: CUDA 12.1+ (NVIDIA), Metal (Apple Silicon), Vulkan (all GPUs including Intel iGPU), or ROCm (AMD). MLC-LLM works differently from most engines: models must be compiled to a platform-specific library via TVM Unity before inference. For pre-compiled models: mlc_llm chat HF://mlc-ai/Llama-3.2-3B-Instruct-q4f16_1-MLC — this downloads a pre-compiled model and starts an interactive CLI. For server mode: mlc_llm serve HF://mlc-ai/Llama-3.2-3B-Instruct-q4f16_1-MLC --port 8080. The server exposes an OpenAI-compatible API at /v1/chat/completions. Verify: curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"default","messages":[{"role":"user","content":"Hello"}]}'. For custom models, compile with mlc_llm compile <model-path> --device <target>. First run downloads the pre-compiled model package (~4–6 GB for a 7B) and starts in 3–8 minutes. Time-to-first-response from zero: ~3 minutes with pre-compiled model. MLC-LLM also supports WebGPU (browser) and iOS/Android via native runtime.

Workload fit

Best for: cross-platform local inference where the same model must run on phone, laptop, and server, WebGPU browser-based inference (in-browser LLM demos without a backend server), mobile on-device inference with optimized native runtimes, heterogeneous GPU deployments (Intel Arc, AMD Radeon, Qualcomm Adreno, Apple GPU) where CUDA-only engines can't deploy, research and experimentation with model compilation pipelines. Not suited for: rapid model iteration where compilation cost kills velocity (use Ollama or llama.cpp), maximum-throughput NVIDIA datacenter serving (use vLLM), GGUF-based model ecosystems without MLC-format re-compilation, users who need point-and-click setup (MLC-LLM requires compilation awareness).

Alternatives

Use MLC-LLM when you need inference across the widest range of device targets — Windows, Linux, macOS, iOS, Android, and WebGPU (browser) from a single compilation pipeline. MLC-LLM's TVM-based compilation approach produces the best GPU utilization on non-NVIDIA hardware (Intel iGPU, Mali, Adreno mobile GPUs) of any engine. Switch to llama.cpp when you need instant model loading without a compilation step — MLC-LLM requires pre-compiled model packages. Use vLLM for NVIDIA datacenter production serving where throughput matters more than deployment breadth. Use MLX-LM on Apple Silicon for simpler setup — MLC-LLM works on Apple Silicon but requires the compilation step that MLX-LM skips. MLC-LLM's unique value is "write once, deploy everywhere" — the same compiled model runs on a phone, laptop, and server.

Troubleshooting + when to switch

Problem: TVMError: Cannot find tuned kernel for target <gpu_arch>. Fix: The pre-compiled model was built for a different GPU architecture. Download a model compiled for your specific target: mlc_llm chat HF://mlc-ai/Llama-3.2-3B-Instruct-q4f16_1-MLC --device vulkan:1.3 for Vulkan, --device metal for Apple, --device cuda for NVIDIA. MLC-LLM model packages are device-specific. Problem: Compilation from source takes hours. Fix: MLC-LLM model compilation is TVM-level auto-tuning — it searches a kernel space for optimal tensor operations. Use --opt O2 instead of O3 for faster compilation with 5–10% throughput loss. For development, always use pre-compiled models from the MLC-AI org on HuggingFace. Problem: WebGPU browser deployment fails on Firefox. Fix: WebGPU model serving requires Chrome/Edge (Chromium) with WebGPU enabled. Firefox WebGPU support is behind a flag and not production-ready. Test on Chrome Canary or Edge Dev with --enable-unsafe-webgpu flag.

Featured in this stack

The L3 execution stacks that pick this tool as a recommended component, with the one-line note explaining the role it plays in each.

Stack · L3·Homelab tier·Role: Cross-device runtime (Adreno GPU path)

Android on-device AI stack — Phi-3.5 Mini / Llama 3.2 3B via MLC LLM or Qualcomm AI Hub

MLC LLM is the cross-platform choice. Same model checkpoint compiles for Adreno GPU + iOS Metal + WebGPU. The right pick when you need Android + iOS shipping from one toolchain. Adreno path doesn't use the Hexagon NPU.

Runtime health

Operator-grade signals on how actively MLC LLM is being maintained, how fresh its measurements are, and what failure classes operators have flagged. Every label below is anchored to a real date or count — we never infer maintainer activity we can't show.

Release cadence

Derived from the most recent editorial signal on this row.

Active

Updated May 9, 2026

5 days since last refresh · source: lastUpdated

Benchmark freshness

How recent the editorial measurements on this runtime are.

0editorial benchmarks

No editorial benchmarks for this runtime yet.

Community reproduction

Submissions that match an editorial measurement on similar hardware.

0reproduced reports

No community reproductions on file yet.

Frequently asked

Is MLC LLM free?

MLC LLM has a paid tier (free + open-source). Check the pricing page for current terms.

What operating systems does MLC LLM support?

MLC LLM supports iOS, Android, Windows, macOS, Linux.

Which GPUs work with MLC LLM?

MLC LLM supports NVIDIA, AMD, Apple, Qualcomm Adreno, Mali. CPU-only inference is also possible but slow.

Operating systems	iOS Android Windows macOS Linux
GPU backends	NVIDIA AMD Apple Qualcomm Adreno Mali
License	Open source · free + open-source

Overview