Capability notes
On-device Android AI in 2026 operates across three fragmented lanes: **Gemini Nano** (Google AICore system service), **Samsung Galaxy AI** (One UI features), and **Qualcomm AI Engine** (Snapdragon NPU). The fragmentation has operational consequences.
**Gemini Nano** ships via AICore — a Play Services daemon that loads 1.8B-3.25B parameter models on the GPU/NPU. It powers on-device summarization, smart reply, and proofreading on Pixel 8+ and Galaxy S24+. AICore manages model download (1-4GB) and updates through Play Services — operators do not manage the model lifecycle. Availability depends on hardware (Tensor G3/G4 minimum) and OEM integration lag (Samsung ships Nano features 2-6 months behind Pixel). AICore has been backported to [snapdragon-8-gen-3](/hardware/snapdragon-8-gen-3) devices with 8GB+ RAM via GPU fallback; NPU acceleration remains Tensor-only for Nano.
**Samsung Galaxy AI** layers proprietary on-device models on top of Gemini Nano. Live Translate, Chat Assist, and Note Assist run locally on S24/S25 with 12GB+ RAM. These models are locked to One UI with no third-party app access. Updates arrive with One UI version bumps (quarterly-to-semi-annual). Knox MDM can disable specific features but cannot load custom models into Galaxy AI.
**Qualcomm AI Engine Direct SDK** provides NPU access for third-party apps. The Hexagon NPU on [snapdragon-8-elite](/hardware/snapdragon-8-elite) delivers 45+ INT4 TOPS. Qualcomm AI Hub distributes pre-optimized QNN-format models (Llama, Stable Diffusion, Whisper). Mixed-precision FP16/INT8/INT4 yields 25-40% better battery efficiency than GPU-only inference.
**Open-weight inference** via [llama.cpp](/tools/llama-cpp) Android runs through Termux or NDK bindings. Performance is GPU-bound on Adreno at ~60-70% of equivalent Apple Metal paths. Ceiling: 7B Q4 at 4K on 12GB devices; 8GB restricted to 3B class. The critical constraint: Android's memory pressure kills background processes aggressively, and OEM fragmentation means kill policies differ across Samsung, Xiaomi, and Pixel.
If you just want to try this
Lowest-friction path to a working setup.
On a Pixel 9 or Samsung Galaxy S25 with 12GB+ RAM, Gemini Nano is pre-installed. For running your own open-weight models, the simplest path uses **MLC Chat** from the Google Play Store.
Path 1: Use what's built in. Open Recorder on a Pixel 8+, record a conversation, tap "Summarize." This runs entirely on-device via Gemini Nano through AICore — no network, no setup. Smart Reply in Gboard also uses Nano on-device.
Path 2: Install MLC Chat from Google Play. The app ships GGUF models optimized for Adreno GPU. Download **Qwen-3-8B (Q4_K_M)** at 4.2GB or **Llama-3.2-3B (Q4_K_M)** at 1.9GB. The 3B loads in 6-8 seconds on [Snapdragon 8 Gen 3](/hardware/snapdragon-8-gen-3), runs at 25-35 tok/s with 12GB RAM, and fits with comfortable headroom. The 8B model runs at 12-18 tok/s — usable but tight on 8GB devices.
Path 3: For Snapdragon-only devices, Qualcomm AI Hub provides pre-compiled QNN models optimized for the Hexagon NPU. These run 30-50% faster than GGUF GPU inference on the same device. The tradeoff: QNN is Snapdragon-only; MediaTek and Exynos devices are unsupported.
What you get: on-device chat, summarization, and writing assistance. The experience is similar to iPhone on-device AI but fragmented — a Pixel 9's Nano performance differs from a Galaxy S25's even with the same SoC due to OEM thermal policies. Expect 15-25 tok/s on 7B Q4 on [Snapdragon 8 Elite](/hardware/snapdragon-8-elite), 10-15 tok/s on [Tensor G4](/hardware/google-tensor-g4).
For production deployment
Operator-grade recommendation.
Deploying on-device AI to an Android fleet requires planning for OEM fragmentation, MDM policy variation, and background process reliability.
**Target device selection.** Minimum: [snapdragon-8-gen-3](/hardware/snapdragon-8-gen-3) or [google-tensor-g4](/hardware/google-tensor-g4) with 12GB RAM. 8GB devices should be treated as 3B-model-only. Samsung S24/S25 and Pixel 9 are the only families with consistent AICore + NPU maturity. MediaTek Dimensity and Exynos have spotty NPU driver quality — Qualcomm's NPU SDK has 3-4x the deployment volume. For enterprise fleets, standardize on one OEM and SoC generation.
**Gemini Nano via AI Edge SDK.** Lowest-friction path for features that do not need custom models. AICore handles model lifecycle. Your app calls the Edge SDK API; AICore dispatches to Tensor TPU or Qualcomm GPU fallback. Gated by Play Services 24.30+. Provides text generation, summarization, and embeddings. No custom fine-tuning.
**Custom model deployment.** Ship a GGUF model for offline availability (bloats APK by 2-5GB) or download on first launch. For MDM fleets, download-on-launch with a pinned model URL + checksum via Managed Configuration is standard. Use Android Management API or Knox to push config keys: model_url, model_checksum, model_version.
**Background process reliability by OEM:**
- Pixel: Most permissive. Foreground service with notification keeps model in memory for minutes.
- Samsung: Kills foreground services within 60-120 seconds of screen-off. Must whitelist app in Battery → Unrestricted via Knox policy.
- Xiaomi: Most aggressive. Kills within 30 seconds, sometimes ignores foreground service. For Xiaomi fleets, design apps to reload models on each foreground entry.
**Offline guarantees.** Inference works offline indefinitely. Model updates and license checks need periodic connectivity — design for 30-day offline grace periods with MDM-pinned model versions.
What breaks
Failure modes operators see in the wild.
- **OEM model version lag.** Samsung ships Galaxy AI models in firmware bundles updated with One UI (quarterly-to-semi-annual). Pixel gets Gemini Nano updates through Play Services within days. Symptom: Samsung fleet runs Nano versions 2-6 months behind, causing inconsistent behavior across mixed-OEM deployments. Mitigation: pin to AI Edge SDK rather than Samsung proprietary APIs. When Samsung-only features are required, test against the specific firmware build.
- **NPU driver bugs on MediaTek/Exynos.** Qualcomm's NPU has 3-4x the deployment volume. Driver bugs on non-Qualcomm SoCs cause silent accuracy degradation (INT8 quantization differences producing different token output) or crashes on specific architectures. Symptom: same GGUF model produces coherent output on Snapdragon but garbled text on Dimensity. Mitigation: standardize on Snapdragon for custom model inference. Qualify each model architecture on the specific chip + firmware revision if non-Qualcomm is required.
- **Memory pressure kills AI processes.** Android's LMK targets the largest memory consumer — the inference process holding 4.5GB. Symptom: user opens camera during inference; Android kills the AI process; conversation lost. Mitigation: use foreground service with persistent notification. Serialize KV cache to disk for fast recovery. Accept that background execution cannot be guaranteed — design for foreground-only inference.
- **Thermal ceiling below iPhone.** Snapdragon 8 Gen 3/Elite throttles NPU after 4-6 minutes vs 8-12 minutes on A18 Pro. Samsung throttles at 42°C external; Pixel at 44°C. Symptom: 7B inference drops from 18 tok/s to 8 tok/s after 5 minutes on Galaxy S25. Mitigation: cap bursts under 3 minutes, idle 60 seconds between. Design server-side fallback for sustained loops.
- **Background service fragmentation by OEM.** Xiaomi kills processes within 30 seconds even with foreground notification. Samsung allows 60-120 seconds. Pixel permits minutes. Symptom: app works on Pixel test device but dies on fleet Xiaomi devices. Mitigation: test on every OEM in your fleet. For Xiaomi/OPPO/Vivo, assume model reload on every foreground entry.
- **Google Play Services dependency.** AICore requires Play Services 24.30+, distributed through Play Store. De-Googled devices (Huawei, AOSP forks) cannot use Gemini Nano. Mitigation: fall back to llama.cpp or Qualcomm AI Engine Direct SDK.
Hardware guidance
**Hobbyist: Pixel 9 (12GB, Tensor G4)**
Best Gemini Nano experience. AICore pre-installed; Recorder summarization and Smart Reply work out of the box. For third-party models, llama.cpp via Termux runs 3B Q4 at 20-25 tok/s, 7B Q4 at 10-14 tok/s. Tensor TPU is locked to Google workloads — third-party NPU goes through GPU. Pixel's permissive foreground service policy makes it the best dev device.
**Hobbyist: Galaxy S25 (12GB, Snapdragon 8 Elite)**
Best Qualcomm NPU performance — 45+ TOPS INT4, QNN-optimized models run 30-50% faster than GPU-only. Galaxy AI features are on-device and functional. Tradeoff: One UI kills processes aggressively — manual battery whitelisting required.
**SMB: Standardized S24/S25 fleet (12GB)**
Single-OEM fleet via Knox MDM. Deploy custom app via Android Enterprise with Managed Configuration for model pinning. Whitelist app in Device Care → Battery via Knox policy. Cost: ~$799-999/device + Knox ($2-5/device/month). Standardize on one generation — mixing S24 and S25 doubles testing surface.
**Enterprise: Custom app + MDM + private model**
Build a native app wrapping [llama.cpp](/tools/llama-cpp) JNI bindings. Distribute via Managed Google Play. Use Android Management API to push model config and version pins. The hardest problem is OEM fragmentation — a deployment across Samsung + Pixel + Xiaomi requires 3x QA. If fleet homogeneity is achievable, Android on-device AI is operationally viable. If procurement cannot guarantee it, on-device Android AI is not yet enterprise-ready.
**Frontier: Not applicable**
On-device Android cannot serve multi-user workloads, cannot train, and is capped at 7B on 12GB devices. For Android-powered edge servers, [Jetson Orin](/hardware/nvidia-dgx-spark) running Linux is preferred — the Android stack adds OS overhead and background-kill risk with no inference benefit.
Runtime guidance
**Using Gemini Nano built-in features (summarization, smart reply)? → Google AI Edge SDK + AICore**
Zero inference management. AICore handles model lifecycle. Your app calls the Edge SDK; system dispatches to Tensor TPU or Qualcomm GPU. Supports text generation, summarization, embeddings. Model quality and updates are Google-controlled. No fine-tuning. Requires Play Services 24.30+ on Pixel 8+ or Galaxy S24+.
**Deploying a custom model to Snapdragon-only fleet? → Qualcomm AI Engine Direct + QNN**
If the fleet is 100% Snapdragon, QNN provides the best performance. Hexagon NPU runs pre-optimized models at 30-50% higher throughput vs GPU-only with 25-40% lower battery drain. FP16 for attention, INT4/INT8 for feed-forward. Model catalog: Llama, Stable Diffusion, Whisper. Snapdragon-only — breaks on MediaTek, Exynos, or Tensor devices.
**Deploying a custom model across multiple SoCs/OEMs? → llama.cpp Android**
[llama.cpp](/tools/llama-cpp) NDK bindings provide the broadest hardware compatibility. GGUF models run on Adreno GPU, Mali GPU, and CPU — one APK covers Snapdragon, Tensor, MediaTek, Exynos. Tradeoff: 30-50% below QNN on Snapdragon, 10-15% below Metal on Apple Silicon. Benefit: GGUF is universal — thousands of pre-quantized models, zero toolchain setup. The practical path for any deployment that is not 100% Snapdragon.
**Comparison:**
- Performance (7B Q4): QNN NPU 25-35 tok/s (8 Elite), llama.cpp Vulkan 15-20, Gemini Nano N/A
- SoC compatibility: QNN Snapdragon-only, llama.cpp all SoCs, Nano Tensor + select Snapdragon
- Model catalog: QNN limited (3 main architectures), llama.cpp 40+ architectures, Nano Google-curated
- Model lifecycle: QNN manual, llama.cpp manual, Nano automatic (Play Services)
- Battery efficiency: QNN NPU best, llama.cpp GPU good, Nano excellent
**Shipping a Play Store chat app?** MLC-LLM (Apache TVM + Vulkan) and [llama.cpp](/tools/llama-cpp) (NDK) are the viable frameworks. Invest in reload-time UX — snappy model loading matters more than raw token rate because Android kills inference processes frequently.