Android on-device AI stack (May 2026) — Phi-3.5 Mini / Llama 3.2 3B via MLC LLM or Qualcomm AI Hub

What this stack accomplishes

The Android on-device AI question has two valid paths, not one. The honest answer depends on whether your app needs to run on devices outside the Snapdragon flagship envelope:

MLC LLM: cross-device, Adreno GPU path. Same toolchain compiles for Android + iOS + Web. Slower per-token than NPU but works on most modern Android phones.
Qualcomm AI Hub: Snapdragon NPU path. Faster and lower-power on Hexagon NPU; locked to Qualcomm chips. Excludes Pixel (Tensor G4) and MediaTek devices.

Hardware required

Snapdragon 8 Gen 3 / 8 Elite (Hexagon NPU, 45-80 TOPS INT8) for the NPU path · or any modern Adreno-equipped device for the MLC LLM GPU path · Android 13+ for ExecuTorch / MLC LLM compatibility · Android Studio + NDK for build/sign · ~5GB workstation storage for the model + Android Studio caches

Components — what to install and why

The stack

01
HardwareTarget SoC (flagship 2024-2025)
snapdragon-8-elite
Snapdragon 8 Elite Hexagon NPU at ~80 TOPS INT8 + Adreno GPU. The 16GB RAM tier enables comfortable 3-4B model headroom. Pair with Qualcomm AI Hub for production NPU-first deployment.
02
HardwareMid-tier 2023 flagship (still production-viable)
snapdragon-8-gen-3
Snapdragon 8 Gen 3 Hexagon NPU at 45 TOPS INT8 + 12GB+ RAM. The first widely-shipped Android NPU that runs 7B-class models on-device. Most Pixel 8 / Galaxy S24 deployments use this tier.
03
HardwarePixel-only path
google-tensor-g4
Tensor G4 ships in Pixel 9. Google's Gemini Nano runs natively. NPU TOPS aren't publicly disclosed — community benchmarks suggest mid-Snapdragon parity. Tensor's path is Pixel-locked.
04
ToolCross-device runtime (Adreno GPU path)
mlc-llm
MLC LLM is the cross-platform choice. Same model checkpoint compiles for Adreno GPU + iOS Metal + WebGPU. The right pick when you need Android + iOS shipping from one toolchain. Adreno path doesn't use the Hexagon NPU.
05
ToolSnapdragon NPU runtime (Hexagon path)
qualcomm-ai-hub
Qualcomm-published quants tuned for Hexagon NPU. The throughput leader on Snapdragon flagship phones — beats MLC LLM Adreno path by ~30-50% per Qualcomm's published numbers. Snapdragon-only; no Tensor G4 / MediaTek support.
06
ToolPyTorch-native alternative (NNAPI / Vulkan delegate)
executorch
ExecuTorch is PyTorch-first-party. Backend-pluggable: NNAPI (Android), Vulkan (cross-vendor GPU), custom NPU delegates. Pick when your model authoring is PyTorch-native and you don't want a separate compile pipeline.
07
ModelPrimary 3.8B model
phi-3.5-mini-instruct
Phi-3.5 Mini at INT4 (~2.3GB) fits comfortably on 12GB+ Android phones. MIT licensed. Microsoft's published Phi Silica benchmarks demonstrate this exact configuration on Snapdragon X Elite.
08
ModelAlternative 3B chat model
llama-3.2-3b-instruct
Llama 3.2 3B at INT4 (~1.9GB) fits with more headroom. Llama Community License permits app-bundling. Quants available on the MLC LLM model zoo.

Step-by-step setup (MLC LLM Android path)

MLC LLM is cross-platform. Same toolchain ships for Android + iOS + Web + desktop. The Android variant:

1. Clone + build the Android template

git clone https://github.com/mlc-ai/mlc-llm
cd mlc-llm/android

# MLC LLM ships an Android Studio template project
# Open in Android Studio:
# File → Open → mlc-llm/android/MLCChat
# Sync Gradle (auto-runs)

2. Compile a model for Android

# On your dev workstation
pip install mlc-llm
mlc_llm convert_weight \
  HF://microsoft/Phi-3.5-mini-instruct \
  --quantization q4f16_1 \
  -o ./dist/phi-3.5-mini-instruct-q4f16_1-MLC

mlc_llm gen_config \
  HF://microsoft/Phi-3.5-mini-instruct \
  --quantization q4f16_1 \
  --conv-template phi-3 \
  -o ./dist/phi-3.5-mini-instruct-q4f16_1-MLC

mlc_llm compile \
  ./dist/phi-3.5-mini-instruct-q4f16_1-MLC/mlc-chat-config.json \
  --device android \
  -o ./dist/phi-3.5-mini-instruct-q4f16_1-android.tar

3. Bundle + ship

# In Android Studio, add the compiled .tar + weights to:
#   app/src/main/assets/

# In MainActivity.kt:
val engine = MLCEngine()
engine.reload("phi-3.5-mini-instruct-q4f16_1")
val response = engine.chatCompletion(prompt = "Summarize this in one sentence: ...")
println("Tokens/sec: ${response.tokensPerSecond}")
println(response.text)

Alternative: Qualcomm AI Hub (Snapdragon NPU path)

# Qualcomm AI Hub workflow:
# 1. Sign up at https://aihub.qualcomm.com
# 2. Browse the model zoo for pre-compiled NPU variants
#    (Llama 3.2 3B, Phi-3.5 Mini, Gemma 3 1B, Qwen 2.5 3B)
# 3. Download the .qnn binary + sample app
# 4. Integrate with the QNN HTP runtime in your Android app
# 5. The NPU path runs ~30-50% faster than Adreno GPU on Snapdragon flagships

Thermal + battery reality check

Mobile NPU + GPU inference is thermally bounded:

First 2-3 minutes: peak tok/s. Bursty UX (summarize an article in 30s) wins here.
5-10 minutes: throttle 25-40% under sustained load.
Battery cost: ~5-10% per 10-min active session on Snapdragon 8 Elite (editorial estimate). Measure on your workload.
Background interruption: Android aggressively kills backgrounded high-CPU apps. Use Foreground Service for long-running summarization tasks.
Charging: mitigates throttle but adds heat in the other direction. Test plugged + unplugged.

Expected outcome

Ship an Android app that loads a 3-4B model checkpoint at app start (~3-5 sec on Snapdragon 8 Gen 3+), serves single-stream LLM inference at editorial-estimated 8-25 tok/s decode (cold). Sustained-load throttle drops 25-40% after 5-10 min. Battery cost: ~5-10% per 10-min active session (editorial estimate). Verify on your specific device before shipping.

MLC LLM vs Qualcomm AI Hub — honest comparison

Dimension	MLC LLM (Adreno GPU)	Qualcomm AI Hub (Hexagon NPU)
Cross-device support	Snapdragon + Tensor + MediaTek + iOS + Web	Snapdragon only
Throughput on Snapdragon flagship	10-22 tok/s (editorial estimate)	12-25 tok/s (editorial estimate)
Battery efficiency	Higher draw (Adreno is GPU-class)	Lower draw (NPU is purpose-built)
Open source	Yes (Apache 2.0)	Compiled binaries; toolchain closed
Quant variety	q4f16, q4f32, q8f16, etc. (TVM-quants)	Vendor-published only
Setup complexity	Compile-step required (~10-30 min per model)	Pre-compiled drop-in
Model selection	Any Hugging Face model with TVM converter	Qualcomm-curated zoo

Pick MLC LLM if your app must run on Pixel, Samsung Tensor, or non-Snapdragon Androids. Or if you also ship iOS / Web from the same toolchain.

Pick Qualcomm AI Hub if your app is Snapdragon-only (gaming, niche pro apps), and battery / thermal envelope matters more than cross-device support.

Failure modes you'll hit

Cold-start latency feels broken. First model load is 3-5 seconds on Snapdragon 8 Gen 3+. Pre-warm at app launch on a background thread.
Memory pressure crashes on 8GB phones. 3-4B INT4 model + activations + Android UI on 8GB total = pressure. Test with low-memory devices; consider 1.5B / 3B fallback.
Backgrounding kills inference. Android suspends apps aggressively. Foreground Service + persistent notification for long-running summarization.
Thermal throttling looks like model degradation. Sustained load >5-10 min: 25-40% slowdown. Surface a “device warming up” indicator if your UX needs it.
NDK / build-tools version conflicts. MLC LLM requires NDK 25+. Mismatched NDK in your project causes obscure linker errors. Pin in build.gradle.
APK size limits. Google Play APK cap is 200MB (use App Bundle for larger). 3-4B INT4 model is ~2-3GB — requires Play Asset Delivery for distribution.

Troubleshooting

Symptom: Adreno GPU not detected. MLC LLM's OpenCL backend requires the device to expose Adreno via standard OpenCL drivers. Some non-Qualcomm Androids ship without working OpenCL. Fall back to ExecuTorch + Vulkan in that case.

Symptom: Qualcomm AI Hub binary won't load on older Snapdragon. The Hexagon HTP runtime requires SoC version-specific binaries. Build separately for Snapdragon 8 Gen 2 / Gen 3 / 8 Elite — Qualcomm AI Hub provides per-SoC artifacts.

Variations and alternatives

ExecuTorch path: PyTorch-native via NNAPI or Vulkan delegate. Pick when your model authoring is PyTorch-first and you want one toolchain end-to-end. See ExecuTorch operational review.

iOS pairing: see iPhone on-device AI stack for the cross-platform-mobile shipping pattern.

Who should avoid this stack

Cross-platform requirement excluding Web — Flutter / React Native devs may prefer ExecuTorch for simpler integration.
7B+ model needs — Android RAM bottlenecks past 4B. Cloud or hybrid is the right answer.
Continuous-use workloads — thermal throttling will visibly degrade UX.
Cost-sensitive distribution — model bundle inflates download size; users on metered connections will churn.

Going deeper

iPhone on-device AI stack — iOS sibling.
MLC LLM operational review — runtime depth.
Qualcomm AI Hub operational review.
Snapdragon 8 Elite hardware page.
Benchmark opportunity queue — Android tok/s measurements pending.