Stack · L3 execution·Homelab tier · mobile

Android on-device AI stack — Phi-3.5 Mini / Llama 3.2 3B via MLC LLM or Qualcomm AI Hub

App-bundled local LLM inference on Android using MLC LLM (cross-device GPU) or Qualcomm AI Hub (Snapdragon NPU). The two paths, honestly compared. No fake benchmark numbers; battery + thermal reality documented.

By Fredoline Eruo · Last reviewed 2026-05-07

What this stack accomplishes

The Android on-device AI question has two valid paths, not one. The honest answer depends on whether your app needs to run on devices outside the Snapdragon flagship envelope:

  • MLC LLM: cross-device, Adreno GPU path. Same toolchain compiles for Android + iOS + Web. Slower per-token than NPU but works on most modern Android phones.
  • Qualcomm AI Hub: Snapdragon NPU path. Faster and lower-power on Hexagon NPU; locked to Qualcomm chips. Excludes Pixel (Tensor G4) and MediaTek devices.

Hardware required

Snapdragon 8 Gen 3 / 8 Elite (Hexagon NPU, 45-80 TOPS INT8) for the NPU path · or any modern Adreno-equipped device for the MLC LLM GPU path · Android 13+ for ExecuTorch / MLC LLM compatibility · Android Studio + NDK for build/sign · ~5GB workstation storage for the model + Android Studio caches

Components — what to install and why

The stack
  1. 01
    HardwareTarget SoC (flagship 2024-2025)
    snapdragon-8-elite

    Snapdragon 8 Elite Hexagon NPU at ~80 TOPS INT8 + Adreno GPU. The 16GB RAM tier enables comfortable 3-4B model headroom. Pair with Qualcomm AI Hub for production NPU-first deployment.

  2. 02
    HardwareMid-tier 2023 flagship (still production-viable)
    snapdragon-8-gen-3

    Snapdragon 8 Gen 3 Hexagon NPU at 45 TOPS INT8 + 12GB+ RAM. The first widely-shipped Android NPU that runs 7B-class models on-device. Most Pixel 8 / Galaxy S24 deployments use this tier.

  3. 03
    HardwarePixel-only path
    google-tensor-g4

    Tensor G4 ships in Pixel 9. Google's Gemini Nano runs natively. NPU TOPS aren't publicly disclosed — community benchmarks suggest mid-Snapdragon parity. Tensor's path is Pixel-locked.

  4. 04
    ToolCross-device runtime (Adreno GPU path)
    mlc-llm

    MLC LLM is the cross-platform choice. Same model checkpoint compiles for Adreno GPU + iOS Metal + WebGPU. The right pick when you need Android + iOS shipping from one toolchain. Adreno path doesn't use the Hexagon NPU.

  5. 05
    ToolSnapdragon NPU runtime (Hexagon path)
    qualcomm-ai-hub

    Qualcomm-published quants tuned for Hexagon NPU. The throughput leader on Snapdragon flagship phones — beats MLC LLM Adreno path by ~30-50% per Qualcomm's published numbers. Snapdragon-only; no Tensor G4 / MediaTek support.

  6. 06
    ToolPyTorch-native alternative (NNAPI / Vulkan delegate)
    executorch

    ExecuTorch is PyTorch-first-party. Backend-pluggable: NNAPI (Android), Vulkan (cross-vendor GPU), custom NPU delegates. Pick when your model authoring is PyTorch-native and you don't want a separate compile pipeline.

  7. 07
    ModelPrimary 3.8B model
    phi-3.5-mini-instruct

    Phi-3.5 Mini at INT4 (~2.3GB) fits comfortably on 12GB+ Android phones. MIT licensed. Microsoft's published Phi Silica benchmarks demonstrate this exact configuration on Snapdragon X Elite.

  8. 08
    ModelAlternative 3B chat model
    llama-3.2-3b-instruct

    Llama 3.2 3B at INT4 (~1.9GB) fits with more headroom. Llama Community License permits app-bundling. Quants available on the MLC LLM model zoo.

Step-by-step setup (MLC LLM Android path)

MLC LLM is cross-platform. Same toolchain ships for Android + iOS + Web + desktop. The Android variant:

1. Clone + build the Android template

git clone https://github.com/mlc-ai/mlc-llm
cd mlc-llm/android

# MLC LLM ships an Android Studio template project
# Open in Android Studio:
# File → Open → mlc-llm/android/MLCChat
# Sync Gradle (auto-runs)

2. Compile a model for Android

# On your dev workstation
pip install mlc-llm
mlc_llm convert_weight \
  HF://microsoft/Phi-3.5-mini-instruct \
  --quantization q4f16_1 \
  -o ./dist/phi-3.5-mini-instruct-q4f16_1-MLC

mlc_llm gen_config \
  HF://microsoft/Phi-3.5-mini-instruct \
  --quantization q4f16_1 \
  --conv-template phi-3 \
  -o ./dist/phi-3.5-mini-instruct-q4f16_1-MLC

mlc_llm compile \
  ./dist/phi-3.5-mini-instruct-q4f16_1-MLC/mlc-chat-config.json \
  --device android \
  -o ./dist/phi-3.5-mini-instruct-q4f16_1-android.tar

3. Bundle + ship

# In Android Studio, add the compiled .tar + weights to:
#   app/src/main/assets/

# In MainActivity.kt:
val engine = MLCEngine()
engine.reload("phi-3.5-mini-instruct-q4f16_1")
val response = engine.chatCompletion(prompt = "Summarize this in one sentence: ...")
println("Tokens/sec: ${response.tokensPerSecond}")
println(response.text)

Alternative: Qualcomm AI Hub (Snapdragon NPU path)

# Qualcomm AI Hub workflow:
# 1. Sign up at https://aihub.qualcomm.com
# 2. Browse the model zoo for pre-compiled NPU variants
#    (Llama 3.2 3B, Phi-3.5 Mini, Gemma 3 1B, Qwen 2.5 3B)
# 3. Download the .qnn binary + sample app
# 4. Integrate with the QNN HTP runtime in your Android app
# 5. The NPU path runs ~30-50% faster than Adreno GPU on Snapdragon flagships

Thermal + battery reality check

Mobile NPU + GPU inference is thermally bounded:

  • First 2-3 minutes: peak tok/s. Bursty UX (summarize an article in 30s) wins here.
  • 5-10 minutes: throttle 25-40% under sustained load.
  • Battery cost: ~5-10% per 10-min active session on Snapdragon 8 Elite (editorial estimate). Measure on your workload.
  • Background interruption: Android aggressively kills backgrounded high-CPU apps. Use Foreground Service for long-running summarization tasks.
  • Charging: mitigates throttle but adds heat in the other direction. Test plugged + unplugged.

Expected outcome

Ship an Android app that loads a 3-4B model checkpoint at app start (~3-5 sec on Snapdragon 8 Gen 3+), serves single-stream LLM inference at editorial-estimated 8-25 tok/s decode (cold). Sustained-load throttle drops 25-40% after 5-10 min. Battery cost: ~5-10% per 10-min active session (editorial estimate). Verify on your specific device before shipping.

MLC LLM vs Qualcomm AI Hub — honest comparison

DimensionMLC LLM (Adreno GPU)Qualcomm AI Hub (Hexagon NPU)
Cross-device supportSnapdragon + Tensor + MediaTek + iOS + WebSnapdragon only
Throughput on Snapdragon flagship10-22 tok/s (editorial estimate)12-25 tok/s (editorial estimate)
Battery efficiencyHigher draw (Adreno is GPU-class)Lower draw (NPU is purpose-built)
Open sourceYes (Apache 2.0)Compiled binaries; toolchain closed
Quant varietyq4f16, q4f32, q8f16, etc. (TVM-quants)Vendor-published only
Setup complexityCompile-step required (~10-30 min per model)Pre-compiled drop-in
Model selectionAny Hugging Face model with TVM converterQualcomm-curated zoo

Pick MLC LLM if your app must run on Pixel, Samsung Tensor, or non-Snapdragon Androids. Or if you also ship iOS / Web from the same toolchain.

Pick Qualcomm AI Hub if your app is Snapdragon-only (gaming, niche pro apps), and battery / thermal envelope matters more than cross-device support.

Failure modes you'll hit

  1. Cold-start latency feels broken. First model load is 3-5 seconds on Snapdragon 8 Gen 3+. Pre-warm at app launch on a background thread.
  2. Memory pressure crashes on 8GB phones. 3-4B INT4 model + activations + Android UI on 8GB total = pressure. Test with low-memory devices; consider 1.5B / 3B fallback.
  3. Backgrounding kills inference. Android suspends apps aggressively. Foreground Service + persistent notification for long-running summarization.
  4. Thermal throttling looks like model degradation. Sustained load >5-10 min: 25-40% slowdown. Surface a “device warming up” indicator if your UX needs it.
  5. NDK / build-tools version conflicts. MLC LLM requires NDK 25+. Mismatched NDK in your project causes obscure linker errors. Pin in build.gradle.
  6. APK size limits. Google Play APK cap is 200MB (use App Bundle for larger). 3-4B INT4 model is ~2-3GB — requires Play Asset Delivery for distribution.

Troubleshooting

Symptom: Adreno GPU not detected. MLC LLM's OpenCL backend requires the device to expose Adreno via standard OpenCL drivers. Some non-Qualcomm Androids ship without working OpenCL. Fall back to ExecuTorch + Vulkan in that case.

Symptom: Qualcomm AI Hub binary won't load on older Snapdragon. The Hexagon HTP runtime requires SoC version-specific binaries. Build separately for Snapdragon 8 Gen 2 / Gen 3 / 8 Elite — Qualcomm AI Hub provides per-SoC artifacts.

Variations and alternatives

ExecuTorch path: PyTorch-native via NNAPI or Vulkan delegate. Pick when your model authoring is PyTorch-first and you want one toolchain end-to-end. See ExecuTorch operational review.

iOS pairing: see iPhone on-device AI stack for the cross-platform-mobile shipping pattern.

Who should avoid this stack

  • Cross-platform requirement excluding Web — Flutter / React Native devs may prefer ExecuTorch for simpler integration.
  • 7B+ model needs — Android RAM bottlenecks past 4B. Cloud or hybrid is the right answer.
  • Continuous-use workloads — thermal throttling will visibly degrade UX.
  • Cost-sensitive distribution — model bundle inflates download size; users on metered connections will churn.

Going deeper