Mobile & Edge
android on-device ai

Android AI

On-device AI on Android. Google Gemini Nano, Samsung Galaxy AI, OEM-specific NPU acceleration.

Capability notes

On-device Android AI in 2026 operates across three fragmented lanes: **Gemini Nano** (Google AICore system service), **Samsung Galaxy AI** (One UI features), and **Qualcomm AI Engine** (Snapdragon NPU). The fragmentation has operational consequences. **Gemini Nano** ships via AICore — a Play Services daemon that loads 1.8B-3.25B parameter models on the GPU/NPU. It powers on-device summarization, smart reply, and proofreading on Pixel 8+ and Galaxy S24+. AICore manages model download (1-4GB) and updates through Play Services — operators do not manage the model lifecycle. Availability depends on hardware (Tensor G3/G4 minimum) and OEM integration lag (Samsung ships Nano features 2-6 months behind Pixel). AICore has been backported to [snapdragon-8-gen-3](/hardware/snapdragon-8-gen-3) devices with 8GB+ RAM via GPU fallback; NPU acceleration remains Tensor-only for Nano. **Samsung Galaxy AI** layers proprietary on-device models on top of Gemini Nano. Live Translate, Chat Assist, and Note Assist run locally on S24/S25 with 12GB+ RAM. These models are locked to One UI with no third-party app access. Updates arrive with One UI version bumps (quarterly-to-semi-annual). Knox MDM can disable specific features but cannot load custom models into Galaxy AI. **Qualcomm AI Engine Direct SDK** provides NPU access for third-party apps. The Hexagon NPU on [snapdragon-8-elite](/hardware/snapdragon-8-elite) delivers 45+ INT4 TOPS. Qualcomm AI Hub distributes pre-optimized QNN-format models (Llama, Stable Diffusion, Whisper). Mixed-precision FP16/INT8/INT4 yields 25-40% better battery efficiency than GPU-only inference. **Open-weight inference** via [llama.cpp](/tools/llama-cpp) Android runs through Termux or NDK bindings. Performance is GPU-bound on Adreno at ~60-70% of equivalent Apple Metal paths. Ceiling: 7B Q4 at 4K on 12GB devices; 8GB restricted to 3B class. The critical constraint: Android's memory pressure kills background processes aggressively, and OEM fragmentation means kill policies differ across Samsung, Xiaomi, and Pixel.

If you just want to try this

Lowest-friction path to a working setup.

On a Pixel 9 or Samsung Galaxy S25 with 12GB+ RAM, Gemini Nano is pre-installed. For running your own open-weight models, the simplest path uses **MLC Chat** from the Google Play Store. Path 1: Use what's built in. Open Recorder on a Pixel 8+, record a conversation, tap "Summarize." This runs entirely on-device via Gemini Nano through AICore — no network, no setup. Smart Reply in Gboard also uses Nano on-device. Path 2: Install MLC Chat from Google Play. The app ships GGUF models optimized for Adreno GPU. Download **Qwen-3-8B (Q4_K_M)** at 4.2GB or **Llama-3.2-3B (Q4_K_M)** at 1.9GB. The 3B loads in 6-8 seconds on [Snapdragon 8 Gen 3](/hardware/snapdragon-8-gen-3), runs at 25-35 tok/s with 12GB RAM, and fits with comfortable headroom. The 8B model runs at 12-18 tok/s — usable but tight on 8GB devices. Path 3: For Snapdragon-only devices, Qualcomm AI Hub provides pre-compiled QNN models optimized for the Hexagon NPU. These run 30-50% faster than GGUF GPU inference on the same device. The tradeoff: QNN is Snapdragon-only; MediaTek and Exynos devices are unsupported. What you get: on-device chat, summarization, and writing assistance. The experience is similar to iPhone on-device AI but fragmented — a Pixel 9's Nano performance differs from a Galaxy S25's even with the same SoC due to OEM thermal policies. Expect 15-25 tok/s on 7B Q4 on [Snapdragon 8 Elite](/hardware/snapdragon-8-elite), 10-15 tok/s on [Tensor G4](/hardware/google-tensor-g4).

For production deployment

Operator-grade recommendation.

Deploying on-device AI to an Android fleet requires planning for OEM fragmentation, MDM policy variation, and background process reliability. **Target device selection.** Minimum: [snapdragon-8-gen-3](/hardware/snapdragon-8-gen-3) or [google-tensor-g4](/hardware/google-tensor-g4) with 12GB RAM. 8GB devices should be treated as 3B-model-only. Samsung S24/S25 and Pixel 9 are the only families with consistent AICore + NPU maturity. MediaTek Dimensity and Exynos have spotty NPU driver quality — Qualcomm's NPU SDK has 3-4x the deployment volume. For enterprise fleets, standardize on one OEM and SoC generation. **Gemini Nano via AI Edge SDK.** Lowest-friction path for features that do not need custom models. AICore handles model lifecycle. Your app calls the Edge SDK API; AICore dispatches to Tensor TPU or Qualcomm GPU fallback. Gated by Play Services 24.30+. Provides text generation, summarization, and embeddings. No custom fine-tuning. **Custom model deployment.** Ship a GGUF model for offline availability (bloats APK by 2-5GB) or download on first launch. For MDM fleets, download-on-launch with a pinned model URL + checksum via Managed Configuration is standard. Use Android Management API or Knox to push config keys: model_url, model_checksum, model_version. **Background process reliability by OEM:** - Pixel: Most permissive. Foreground service with notification keeps model in memory for minutes. - Samsung: Kills foreground services within 60-120 seconds of screen-off. Must whitelist app in Battery → Unrestricted via Knox policy. - Xiaomi: Most aggressive. Kills within 30 seconds, sometimes ignores foreground service. For Xiaomi fleets, design apps to reload models on each foreground entry. **Offline guarantees.** Inference works offline indefinitely. Model updates and license checks need periodic connectivity — design for 30-day offline grace periods with MDM-pinned model versions.

What breaks

Failure modes operators see in the wild.

- **OEM model version lag.** Samsung ships Galaxy AI models in firmware bundles updated with One UI (quarterly-to-semi-annual). Pixel gets Gemini Nano updates through Play Services within days. Symptom: Samsung fleet runs Nano versions 2-6 months behind, causing inconsistent behavior across mixed-OEM deployments. Mitigation: pin to AI Edge SDK rather than Samsung proprietary APIs. When Samsung-only features are required, test against the specific firmware build. - **NPU driver bugs on MediaTek/Exynos.** Qualcomm's NPU has 3-4x the deployment volume. Driver bugs on non-Qualcomm SoCs cause silent accuracy degradation (INT8 quantization differences producing different token output) or crashes on specific architectures. Symptom: same GGUF model produces coherent output on Snapdragon but garbled text on Dimensity. Mitigation: standardize on Snapdragon for custom model inference. Qualify each model architecture on the specific chip + firmware revision if non-Qualcomm is required. - **Memory pressure kills AI processes.** Android's LMK targets the largest memory consumer — the inference process holding 4.5GB. Symptom: user opens camera during inference; Android kills the AI process; conversation lost. Mitigation: use foreground service with persistent notification. Serialize KV cache to disk for fast recovery. Accept that background execution cannot be guaranteed — design for foreground-only inference. - **Thermal ceiling below iPhone.** Snapdragon 8 Gen 3/Elite throttles NPU after 4-6 minutes vs 8-12 minutes on A18 Pro. Samsung throttles at 42°C external; Pixel at 44°C. Symptom: 7B inference drops from 18 tok/s to 8 tok/s after 5 minutes on Galaxy S25. Mitigation: cap bursts under 3 minutes, idle 60 seconds between. Design server-side fallback for sustained loops. - **Background service fragmentation by OEM.** Xiaomi kills processes within 30 seconds even with foreground notification. Samsung allows 60-120 seconds. Pixel permits minutes. Symptom: app works on Pixel test device but dies on fleet Xiaomi devices. Mitigation: test on every OEM in your fleet. For Xiaomi/OPPO/Vivo, assume model reload on every foreground entry. - **Google Play Services dependency.** AICore requires Play Services 24.30+, distributed through Play Store. De-Googled devices (Huawei, AOSP forks) cannot use Gemini Nano. Mitigation: fall back to llama.cpp or Qualcomm AI Engine Direct SDK.

Hardware guidance

**Hobbyist: Pixel 9 (12GB, Tensor G4)** Best Gemini Nano experience. AICore pre-installed; Recorder summarization and Smart Reply work out of the box. For third-party models, llama.cpp via Termux runs 3B Q4 at 20-25 tok/s, 7B Q4 at 10-14 tok/s. Tensor TPU is locked to Google workloads — third-party NPU goes through GPU. Pixel's permissive foreground service policy makes it the best dev device. **Hobbyist: Galaxy S25 (12GB, Snapdragon 8 Elite)** Best Qualcomm NPU performance — 45+ TOPS INT4, QNN-optimized models run 30-50% faster than GPU-only. Galaxy AI features are on-device and functional. Tradeoff: One UI kills processes aggressively — manual battery whitelisting required. **SMB: Standardized S24/S25 fleet (12GB)** Single-OEM fleet via Knox MDM. Deploy custom app via Android Enterprise with Managed Configuration for model pinning. Whitelist app in Device Care → Battery via Knox policy. Cost: ~$799-999/device + Knox ($2-5/device/month). Standardize on one generation — mixing S24 and S25 doubles testing surface. **Enterprise: Custom app + MDM + private model** Build a native app wrapping [llama.cpp](/tools/llama-cpp) JNI bindings. Distribute via Managed Google Play. Use Android Management API to push model config and version pins. The hardest problem is OEM fragmentation — a deployment across Samsung + Pixel + Xiaomi requires 3x QA. If fleet homogeneity is achievable, Android on-device AI is operationally viable. If procurement cannot guarantee it, on-device Android AI is not yet enterprise-ready. **Frontier: Not applicable** On-device Android cannot serve multi-user workloads, cannot train, and is capped at 7B on 12GB devices. For Android-powered edge servers, [Jetson Orin](/hardware/nvidia-dgx-spark) running Linux is preferred — the Android stack adds OS overhead and background-kill risk with no inference benefit.

Runtime guidance

**Using Gemini Nano built-in features (summarization, smart reply)? → Google AI Edge SDK + AICore** Zero inference management. AICore handles model lifecycle. Your app calls the Edge SDK; system dispatches to Tensor TPU or Qualcomm GPU. Supports text generation, summarization, embeddings. Model quality and updates are Google-controlled. No fine-tuning. Requires Play Services 24.30+ on Pixel 8+ or Galaxy S24+. **Deploying a custom model to Snapdragon-only fleet? → Qualcomm AI Engine Direct + QNN** If the fleet is 100% Snapdragon, QNN provides the best performance. Hexagon NPU runs pre-optimized models at 30-50% higher throughput vs GPU-only with 25-40% lower battery drain. FP16 for attention, INT4/INT8 for feed-forward. Model catalog: Llama, Stable Diffusion, Whisper. Snapdragon-only — breaks on MediaTek, Exynos, or Tensor devices. **Deploying a custom model across multiple SoCs/OEMs? → llama.cpp Android** [llama.cpp](/tools/llama-cpp) NDK bindings provide the broadest hardware compatibility. GGUF models run on Adreno GPU, Mali GPU, and CPU — one APK covers Snapdragon, Tensor, MediaTek, Exynos. Tradeoff: 30-50% below QNN on Snapdragon, 10-15% below Metal on Apple Silicon. Benefit: GGUF is universal — thousands of pre-quantized models, zero toolchain setup. The practical path for any deployment that is not 100% Snapdragon. **Comparison:** - Performance (7B Q4): QNN NPU 25-35 tok/s (8 Elite), llama.cpp Vulkan 15-20, Gemini Nano N/A - SoC compatibility: QNN Snapdragon-only, llama.cpp all SoCs, Nano Tensor + select Snapdragon - Model catalog: QNN limited (3 main architectures), llama.cpp 40+ architectures, Nano Google-curated - Model lifecycle: QNN manual, llama.cpp manual, Nano automatic (Play Services) - Battery efficiency: QNN NPU best, llama.cpp GPU good, Nano excellent **Shipping a Play Store chat app?** MLC-LLM (Apache TVM + Vulkan) and [llama.cpp](/tools/llama-cpp) (NDK) are the viable frameworks. Invest in reload-time UX — snappy model loading matters more than raw token rate because Android kills inference processes frequently.

Setup walkthrough

  1. Install MLC Chat from Google Play Store (free, open-source).
  2. Open app → Model Store → download Llama 3.2 3B Q4_K_M (~2 GB).
  3. Type "Summarize the pros and cons of electric vehicles." First response in 3-8 seconds on Snapdragon 8 Gen 3 or newer.
  4. For Google AI features: on Pixel 8/9, Gemini Nano runs automatically for on-device summarization, smart reply, and Recorder transcription. No setup needed — it's built into Android.
  5. For more model variety: install ChatterUI (open-source Android LLM client) or use Termux + pip install llama-cpp-python for CLI inference.
  6. For Samsung devices: Samsung Gauss runs on-device for Galaxy AI features (translation, summarization) on Galaxy S24+.

All processing on-device. Works offline after model download.

The cheap setup

Google Pixel 8a (Tensor G3, 8 GB, ~$400-500 new). Runs Gemini Nano on-device for summarization, Recorder transcription, and smart reply. MLC Chat runs Llama 3.2 3B at 12-20 tok/s. Samsung Galaxy A55 5G (Exynos 1480, 8 GB, ~$350) runs 3B models at 8-15 tok/s via MLC Chat. For $300: used Pixel 7 Pro (Tensor G2, 12 GB, ~$300) runs 3B models at 10-18 tok/s with more RAM headroom. Android's advantage over iPhone for AI: more RAM at lower price points (12 GB on mid-range vs. 8 GB on flagship iPhone).

The serious setup

Samsung Galaxy S24 Ultra (Snapdragon 8 Gen 3, 12 GB, ~$1,300 new) or OnePlus 13 (Snapdragon 8 Elite, 12-16 GB). Runs Llama 3.2 3B at 20-35 tok/s, Qwen 2.5 7B Q4 at 10-18 tok/s. The 12-16 GB RAM allows running 7B-8B models comfortably without OS killing background apps. For Pixel fans: Pixel 9 Pro XL (Tensor G4, 16 GB, ~$1,100) runs Gemini Nano + custom models. The extra RAM on Android flagships (12-16 GB vs. iPhone's 8 GB) makes them better on-device AI platforms.

Common beginner mistake

The mistake: Installing a generic LLM app on a $200 Android phone with 4 GB RAM and a MediaTek chip, then wondering why inference is 10× slower than advertised. Why it fails: Most Android phones ship with budget SoCs (MediaTek Helio, Snapdragon 6-series) that lack a capable NPU and have slow LPDDR4 RAM. The NPU matters enormously — Snapdragon 8 Gen 3's Hexagon NPU provides ~45 TOPS vs. ~5 TOPS on budget SoCs. LPDDR5 bandwidth (50-70 GB/s) vs. LPDDR4 (20-30 GB/s) is the memory bottleneck for LLM inference. The fix: Check the SoC before buying. Minimum for usable on-device LLM: Snapdragon 8 Gen 2, Tensor G3, or Dimensity 9200+. These have capable NPUs + fast RAM. Don't trust "AI phone" marketing — check the actual SoC model.

Recommended setup for android ai

Recommended runtimes

Browse all tools for runtimes that fit this workload.

Reality check

Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.

Common mistakes

  • Buying for spec-sheet VRAM without modeling KV cache + activation overhead
  • Underestimating quantization quality loss below Q4
  • Skipping flash-attention support (real perf gap on long context)
  • Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)

What breaks first

The errors most operators hit when running android ai locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle android ai before committing money.

Specialized buyer guides
Updated 2026 roundup