Android on-device AI stack — Phi-3.5 Mini / Llama 3.2 3B via MLC LLM or Qualcomm AI Hub
App-bundled local LLM inference on Android using MLC LLM (cross-device GPU) or Qualcomm AI Hub (Snapdragon NPU). The two paths, honestly compared. No fake benchmark numbers; battery + thermal reality documented.
What this stack accomplishes
The Android on-device AI question has two valid paths, not one. The honest answer depends on whether your app needs to run on devices outside the Snapdragon flagship envelope:
- MLC LLM: cross-device, Adreno GPU path. Same toolchain compiles for Android + iOS + Web. Slower per-token than NPU but works on most modern Android phones.
- Qualcomm AI Hub: Snapdragon NPU path. Faster and lower-power on Hexagon NPU; locked to Qualcomm chips. Excludes Pixel (Tensor G4) and MediaTek devices.
Hardware required
Snapdragon 8 Gen 3 / 8 Elite (Hexagon NPU, 45-80 TOPS INT8) for the NPU path · or any modern Adreno-equipped device for the MLC LLM GPU path · Android 13+ for ExecuTorch / MLC LLM compatibility · Android Studio + NDK for build/sign · ~5GB workstation storage for the model + Android Studio caches
Components — what to install and why
- 01HardwareTarget SoC (flagship 2024-2025)snapdragon-8-elite
Snapdragon 8 Elite Hexagon NPU at ~80 TOPS INT8 + Adreno GPU. The 16GB RAM tier enables comfortable 3-4B model headroom. Pair with Qualcomm AI Hub for production NPU-first deployment.
- 02HardwareMid-tier 2023 flagship (still production-viable)snapdragon-8-gen-3
Snapdragon 8 Gen 3 Hexagon NPU at 45 TOPS INT8 + 12GB+ RAM. The first widely-shipped Android NPU that runs 7B-class models on-device. Most Pixel 8 / Galaxy S24 deployments use this tier.
- 03HardwarePixel-only pathgoogle-tensor-g4
Tensor G4 ships in Pixel 9. Google's Gemini Nano runs natively. NPU TOPS aren't publicly disclosed — community benchmarks suggest mid-Snapdragon parity. Tensor's path is Pixel-locked.
- 04ToolCross-device runtime (Adreno GPU path)mlc-llm
MLC LLM is the cross-platform choice. Same model checkpoint compiles for Adreno GPU + iOS Metal + WebGPU. The right pick when you need Android + iOS shipping from one toolchain. Adreno path doesn't use the Hexagon NPU.
- 05ToolSnapdragon NPU runtime (Hexagon path)qualcomm-ai-hub
Qualcomm-published quants tuned for Hexagon NPU. The throughput leader on Snapdragon flagship phones — beats MLC LLM Adreno path by ~30-50% per Qualcomm's published numbers. Snapdragon-only; no Tensor G4 / MediaTek support.
- 06ToolPyTorch-native alternative (NNAPI / Vulkan delegate)executorch
ExecuTorch is PyTorch-first-party. Backend-pluggable: NNAPI (Android), Vulkan (cross-vendor GPU), custom NPU delegates. Pick when your model authoring is PyTorch-native and you don't want a separate compile pipeline.
- 07ModelPrimary 3.8B modelphi-3.5-mini-instruct
Phi-3.5 Mini at INT4 (~2.3GB) fits comfortably on 12GB+ Android phones. MIT licensed. Microsoft's published Phi Silica benchmarks demonstrate this exact configuration on Snapdragon X Elite.
- 08ModelAlternative 3B chat modelllama-3.2-3b-instruct
Llama 3.2 3B at INT4 (~1.9GB) fits with more headroom. Llama Community License permits app-bundling. Quants available on the MLC LLM model zoo.
Step-by-step setup (MLC LLM Android path)
MLC LLM is cross-platform. Same toolchain ships for Android + iOS + Web + desktop. The Android variant:
1. Clone + build the Android template
git clone https://github.com/mlc-ai/mlc-llm
cd mlc-llm/android
# MLC LLM ships an Android Studio template project
# Open in Android Studio:
# File → Open → mlc-llm/android/MLCChat
# Sync Gradle (auto-runs)2. Compile a model for Android
# On your dev workstation
pip install mlc-llm
mlc_llm convert_weight \
HF://microsoft/Phi-3.5-mini-instruct \
--quantization q4f16_1 \
-o ./dist/phi-3.5-mini-instruct-q4f16_1-MLC
mlc_llm gen_config \
HF://microsoft/Phi-3.5-mini-instruct \
--quantization q4f16_1 \
--conv-template phi-3 \
-o ./dist/phi-3.5-mini-instruct-q4f16_1-MLC
mlc_llm compile \
./dist/phi-3.5-mini-instruct-q4f16_1-MLC/mlc-chat-config.json \
--device android \
-o ./dist/phi-3.5-mini-instruct-q4f16_1-android.tar3. Bundle + ship
# In Android Studio, add the compiled .tar + weights to:
# app/src/main/assets/
# In MainActivity.kt:
val engine = MLCEngine()
engine.reload("phi-3.5-mini-instruct-q4f16_1")
val response = engine.chatCompletion(prompt = "Summarize this in one sentence: ...")
println("Tokens/sec: ${response.tokensPerSecond}")
println(response.text)Alternative: Qualcomm AI Hub (Snapdragon NPU path)
# Qualcomm AI Hub workflow:
# 1. Sign up at https://aihub.qualcomm.com
# 2. Browse the model zoo for pre-compiled NPU variants
# (Llama 3.2 3B, Phi-3.5 Mini, Gemma 3 1B, Qwen 2.5 3B)
# 3. Download the .qnn binary + sample app
# 4. Integrate with the QNN HTP runtime in your Android app
# 5. The NPU path runs ~30-50% faster than Adreno GPU on Snapdragon flagshipsThermal + battery reality check
Mobile NPU + GPU inference is thermally bounded:
- First 2-3 minutes: peak tok/s. Bursty UX (summarize an article in 30s) wins here.
- 5-10 minutes: throttle 25-40% under sustained load.
- Battery cost: ~5-10% per 10-min active session on Snapdragon 8 Elite (editorial estimate). Measure on your workload.
- Background interruption: Android aggressively kills backgrounded high-CPU apps. Use Foreground Service for long-running summarization tasks.
- Charging: mitigates throttle but adds heat in the other direction. Test plugged + unplugged.
Expected outcome
Ship an Android app that loads a 3-4B model checkpoint at app start (~3-5 sec on Snapdragon 8 Gen 3+), serves single-stream LLM inference at editorial-estimated 8-25 tok/s decode (cold). Sustained-load throttle drops 25-40% after 5-10 min. Battery cost: ~5-10% per 10-min active session (editorial estimate). Verify on your specific device before shipping.
MLC LLM vs Qualcomm AI Hub — honest comparison
| Dimension | MLC LLM (Adreno GPU) | Qualcomm AI Hub (Hexagon NPU) |
|---|---|---|
| Cross-device support | Snapdragon + Tensor + MediaTek + iOS + Web | Snapdragon only |
| Throughput on Snapdragon flagship | 10-22 tok/s (editorial estimate) | 12-25 tok/s (editorial estimate) |
| Battery efficiency | Higher draw (Adreno is GPU-class) | Lower draw (NPU is purpose-built) |
| Open source | Yes (Apache 2.0) | Compiled binaries; toolchain closed |
| Quant variety | q4f16, q4f32, q8f16, etc. (TVM-quants) | Vendor-published only |
| Setup complexity | Compile-step required (~10-30 min per model) | Pre-compiled drop-in |
| Model selection | Any Hugging Face model with TVM converter | Qualcomm-curated zoo |
Pick MLC LLM if your app must run on Pixel, Samsung Tensor, or non-Snapdragon Androids. Or if you also ship iOS / Web from the same toolchain.
Pick Qualcomm AI Hub if your app is Snapdragon-only (gaming, niche pro apps), and battery / thermal envelope matters more than cross-device support.
Failure modes you'll hit
- Cold-start latency feels broken. First model load is 3-5 seconds on Snapdragon 8 Gen 3+. Pre-warm at app launch on a background thread.
- Memory pressure crashes on 8GB phones. 3-4B INT4 model + activations + Android UI on 8GB total = pressure. Test with low-memory devices; consider 1.5B / 3B fallback.
- Backgrounding kills inference. Android suspends apps aggressively. Foreground Service + persistent notification for long-running summarization.
- Thermal throttling looks like model degradation. Sustained load >5-10 min: 25-40% slowdown. Surface a “device warming up” indicator if your UX needs it.
- NDK / build-tools version conflicts. MLC LLM requires NDK 25+. Mismatched NDK in your project causes obscure linker errors. Pin in
build.gradle. - APK size limits. Google Play APK cap is 200MB (use App Bundle for larger). 3-4B INT4 model is ~2-3GB — requires Play Asset Delivery for distribution.
Troubleshooting
Symptom: Adreno GPU not detected. MLC LLM's OpenCL backend requires the device to expose Adreno via standard OpenCL drivers. Some non-Qualcomm Androids ship without working OpenCL. Fall back to ExecuTorch + Vulkan in that case.
Symptom: Qualcomm AI Hub binary won't load on older Snapdragon. The Hexagon HTP runtime requires SoC version-specific binaries. Build separately for Snapdragon 8 Gen 2 / Gen 3 / 8 Elite — Qualcomm AI Hub provides per-SoC artifacts.
Variations and alternatives
ExecuTorch path: PyTorch-native via NNAPI or Vulkan delegate. Pick when your model authoring is PyTorch-first and you want one toolchain end-to-end. See ExecuTorch operational review.
iOS pairing: see iPhone on-device AI stack for the cross-platform-mobile shipping pattern.
Who should avoid this stack
- Cross-platform requirement excluding Web — Flutter / React Native devs may prefer ExecuTorch for simpler integration.
- 7B+ model needs — Android RAM bottlenecks past 4B. Cloud or hybrid is the right answer.
- Continuous-use workloads — thermal throttling will visibly degrade UX.
- Cost-sensitive distribution — model bundle inflates download size; users on metered connections will churn.
Going deeper
- iPhone on-device AI stack — iOS sibling.
- MLC LLM operational review — runtime depth.
- Qualcomm AI Hub operational review.
- Snapdragon 8 Elite hardware page.
- Benchmark opportunity queue — Android tok/s measurements pending.