Android AI

Capability notes

On-device Android AI in 2026 operates across three fragmented lanes: **Gemini Nano** (Google AICore system service), **Samsung Galaxy AI** (One UI features), and **Qualcomm AI Engine** (Snapdragon NPU). The fragmentation has operational consequences. **Gemini Nano** ships via AICore — a Play Services daemon that loads 1.8B-3.25B parameter models on the GPU/NPU. It powers on-device summarization, smart reply, and proofreading on Pixel 8+ and Galaxy S24+. AICore manages model download (1-4GB) and updates through Play Services — operators do not manage the model lifecycle. Availability depends on hardware (Tensor G3/G4 minimum) and OEM integration lag (Samsung ships Nano features 2-6 months behind Pixel). AICore has been backported to [snapdragon-8-gen-3](/hardware/snapdragon-8-gen-3) devices with 8GB+ RAM via GPU fallback; NPU acceleration remains Tensor-only for Nano. **Samsung Galaxy AI** layers proprietary on-device models on top of Gemini Nano. Live Translate, Chat Assist, and Note Assist run locally on S24/S25 with 12GB+ RAM. These models are locked to One UI with no third-party app access. Updates arrive with One UI version bumps (quarterly-to-semi-annual). Knox MDM can disable specific features but cannot load custom models into Galaxy AI. **Qualcomm AI Engine Direct SDK** provides NPU access for third-party apps. The Hexagon NPU on [snapdragon-8-elite](/hardware/snapdragon-8-elite) delivers 45+ INT4 TOPS. Qualcomm AI Hub distributes pre-optimized QNN-format models (Llama, Stable Diffusion, Whisper). Mixed-precision FP16/INT8/INT4 yields 25-40% better battery efficiency than GPU-only inference. **Open-weight inference** via [llama.cpp](/tools/llama-cpp) Android runs through Termux or NDK bindings. Performance is GPU-bound on Adreno at ~60-70% of equivalent Apple Metal paths. Ceiling: 7B Q4 at 4K on 12GB devices; 8GB restricted to 3B class. The critical constraint: Android's memory pressure kills background processes aggressively, and OEM fragmentation means kill policies differ across Samsung, Xiaomi, and Pixel.

If you just want to try this

Lowest-friction path to a working setup.

On a Pixel 9 or Samsung Galaxy S25 with 12GB+ RAM, Gemini Nano is pre-installed. For running your own open-weight models, the simplest path uses **MLC Chat** from the Google Play Store. Path 1: Use what's built in. Open Recorder on a Pixel 8+, record a conversation, tap "Summarize." This runs entirely on-device via Gemini Nano through AICore — no network, no setup. Smart Reply in Gboard also uses Nano on-device. Path 2: Install MLC Chat from Google Play. The app ships GGUF models optimized for Adreno GPU. Download **Qwen-3-8B (Q4_K_M)** at 4.2GB or **Llama-3.2-3B (Q4_K_M)** at 1.9GB. The 3B loads in 6-8 seconds on [Snapdragon 8 Gen 3](/hardware/snapdragon-8-gen-3), runs at 25-35 tok/s with 12GB RAM, and fits with comfortable headroom. The 8B model runs at 12-18 tok/s — usable but tight on 8GB devices. Path 3: For Snapdragon-only devices, Qualcomm AI Hub provides pre-compiled QNN models optimized for the Hexagon NPU. These run 30-50% faster than GGUF GPU inference on the same device. The tradeoff: QNN is Snapdragon-only; MediaTek and Exynos devices are unsupported. What you get: on-device chat, summarization, and writing assistance. The experience is similar to iPhone on-device AI but fragmented — a Pixel 9's Nano performance differs from a Galaxy S25's even with the same SoC due to OEM thermal policies. Expect 15-25 tok/s on 7B Q4 on [Snapdragon 8 Elite](/hardware/snapdragon-8-elite), 10-15 tok/s on [Tensor G4](/hardware/google-tensor-g4).

For production deployment

Operator-grade recommendation.

Deploying on-device AI to an Android fleet requires planning for OEM fragmentation, MDM policy variation, and background process reliability. **Target device selection.** Minimum: [snapdragon-8-gen-3](/hardware/snapdragon-8-gen-3) or [google-tensor-g4](/hardware/google-tensor-g4) with 12GB RAM. 8GB devices should be treated as 3B-model-only. Samsung S24/S25 and Pixel 9 are the only families with consistent AICore + NPU maturity. MediaTek Dimensity and Exynos have spotty NPU driver quality — Qualcomm's NPU SDK has 3-4x the deployment volume. For enterprise fleets, standardize on one OEM and SoC generation. **Gemini Nano via AI Edge SDK.** Lowest-friction path for features that do not need custom models. AICore handles model lifecycle. Your app calls the Edge SDK API; AICore dispatches to Tensor TPU or Qualcomm GPU fallback. Gated by Play Services 24.30+. Provides text generation, summarization, and embeddings. No custom fine-tuning. **Custom model deployment.** Ship a GGUF model for offline availability (bloats APK by 2-5GB) or download on first launch. For MDM fleets, download-on-launch with a pinned model URL + checksum via Managed Configuration is standard. Use Android Management API or Knox to push config keys: model_url, model_checksum, model_version. **Background process reliability by OEM:** - Pixel: Most permissive. Foreground service with notification keeps model in memory for minutes. - Samsung: Kills foreground services within 60-120 seconds of screen-off. Must whitelist app in Battery → Unrestricted via Knox policy. - Xiaomi: Most aggressive. Kills within 30 seconds, sometimes ignores foreground service. For Xiaomi fleets, design apps to reload models on each foreground entry. **Offline guarantees.** Inference works offline indefinitely. Model updates and license checks need periodic connectivity — design for 30-day offline grace periods with MDM-pinned model versions.

What breaks

Failure modes operators see in the wild.

- **OEM model version lag.** Samsung ships Galaxy AI models in firmware bundles updated with One UI (quarterly-to-semi-annual). Pixel gets Gemini Nano updates through Play Services within days. Symptom: Samsung fleet runs Nano versions 2-6 months behind, causing inconsistent behavior across mixed-OEM deployments. Mitigation: pin to AI Edge SDK rather than Samsung proprietary APIs. When Samsung-only features are required, test against the specific firmware build. - **NPU driver bugs on MediaTek/Exynos.** Qualcomm's NPU has 3-4x the deployment volume. Driver bugs on non-Qualcomm SoCs cause silent accuracy degradation (INT8 quantization differences producing different token output) or crashes on specific architectures. Symptom: same GGUF model produces coherent output on Snapdragon but garbled text on Dimensity. Mitigation: standardize on Snapdragon for custom model inference. Qualify each model architecture on the specific chip + firmware revision if non-Qualcomm is required. - **Memory pressure kills AI processes.** Android's LMK targets the largest memory consumer — the inference process holding 4.5GB. Symptom: user opens camera during inference; Android kills the AI process; conversation lost. Mitigation: use foreground service with persistent notification. Serialize KV cache to disk for fast recovery. Accept that background execution cannot be guaranteed — design for foreground-only inference. - **Thermal ceiling below iPhone.** Snapdragon 8 Gen 3/Elite throttles NPU after 4-6 minutes vs 8-12 minutes on A18 Pro. Samsung throttles at 42°C external; Pixel at 44°C. Symptom: 7B inference drops from 18 tok/s to 8 tok/s after 5 minutes on Galaxy S25. Mitigation: cap bursts under 3 minutes, idle 60 seconds between. Design server-side fallback for sustained loops. - **Background service fragmentation by OEM.** Xiaomi kills processes within 30 seconds even with foreground notification. Samsung allows 60-120 seconds. Pixel permits minutes. Symptom: app works on Pixel test device but dies on fleet Xiaomi devices. Mitigation: test on every OEM in your fleet. For Xiaomi/OPPO/Vivo, assume model reload on every foreground entry. - **Google Play Services dependency.** AICore requires Play Services 24.30+, distributed through Play Store. De-Googled devices (Huawei, AOSP forks) cannot use Gemini Nano. Mitigation: fall back to llama.cpp or Qualcomm AI Engine Direct SDK.

Hardware guidance

**Hobbyist: Pixel 9 (12GB, Tensor G4)** Best Gemini Nano experience. AICore pre-installed; Recorder summarization and Smart Reply work out of the box. For third-party models, llama.cpp via Termux runs 3B Q4 at 20-25 tok/s, 7B Q4 at 10-14 tok/s. Tensor TPU is locked to Google workloads — third-party NPU goes through GPU. Pixel's permissive foreground service policy makes it the best dev device. **Hobbyist: Galaxy S25 (12GB, Snapdragon 8 Elite)** Best Qualcomm NPU performance — 45+ TOPS INT4, QNN-optimized models run 30-50% faster than GPU-only. Galaxy AI features are on-device and functional. Tradeoff: One UI kills processes aggressively — manual battery whitelisting required. **SMB: Standardized S24/S25 fleet (12GB)** Single-OEM fleet via Knox MDM. Deploy custom app via Android Enterprise with Managed Configuration for model pinning. Whitelist app in Device Care → Battery via Knox policy. Cost: ~$799-999/device + Knox ($2-5/device/month). Standardize on one generation — mixing S24 and S25 doubles testing surface. **Enterprise: Custom app + MDM + private model** Build a native app wrapping [llama.cpp](/tools/llama-cpp) JNI bindings. Distribute via Managed Google Play. Use Android Management API to push model config and version pins. The hardest problem is OEM fragmentation — a deployment across Samsung + Pixel + Xiaomi requires 3x QA. If fleet homogeneity is achievable, Android on-device AI is operationally viable. If procurement cannot guarantee it, on-device Android AI is not yet enterprise-ready. **Frontier: Not applicable** On-device Android cannot serve multi-user workloads, cannot train, and is capped at 7B on 12GB devices. For Android-powered edge servers, [Jetson Orin](/hardware/nvidia-dgx-spark) running Linux is preferred — the Android stack adds OS overhead and background-kill risk with no inference benefit.

Runtime guidance

**Using Gemini Nano built-in features (summarization, smart reply)? → Google AI Edge SDK + AICore** Zero inference management. AICore handles model lifecycle. Your app calls the Edge SDK; system dispatches to Tensor TPU or Qualcomm GPU. Supports text generation, summarization, embeddings. Model quality and updates are Google-controlled. No fine-tuning. Requires Play Services 24.30+ on Pixel 8+ or Galaxy S24+. **Deploying a custom model to Snapdragon-only fleet? → Qualcomm AI Engine Direct + QNN** If the fleet is 100% Snapdragon, QNN provides the best performance. Hexagon NPU runs pre-optimized models at 30-50% higher throughput vs GPU-only with 25-40% lower battery drain. FP16 for attention, INT4/INT8 for feed-forward. Model catalog: Llama, Stable Diffusion, Whisper. Snapdragon-only — breaks on MediaTek, Exynos, or Tensor devices. **Deploying a custom model across multiple SoCs/OEMs? → llama.cpp Android** [llama.cpp](/tools/llama-cpp) NDK bindings provide the broadest hardware compatibility. GGUF models run on Adreno GPU, Mali GPU, and CPU — one APK covers Snapdragon, Tensor, MediaTek, Exynos. Tradeoff: 30-50% below QNN on Snapdragon, 10-15% below Metal on Apple Silicon. Benefit: GGUF is universal — thousands of pre-quantized models, zero toolchain setup. The practical path for any deployment that is not 100% Snapdragon. **Comparison:** - Performance (7B Q4): QNN NPU 25-35 tok/s (8 Elite), llama.cpp Vulkan 15-20, Gemini Nano N/A - SoC compatibility: QNN Snapdragon-only, llama.cpp all SoCs, Nano Tensor + select Snapdragon - Model catalog: QNN limited (3 main architectures), llama.cpp 40+ architectures, Nano Google-curated - Model lifecycle: QNN manual, llama.cpp manual, Nano automatic (Play Services) - Battery efficiency: QNN NPU best, llama.cpp GPU good, Nano excellent **Shipping a Play Store chat app?** MLC-LLM (Apache TVM + Vulkan) and [llama.cpp](/tools/llama-cpp) (NDK) are the viable frameworks. Invest in reload-time UX — snappy model loading matters more than raw token rate because Android kills inference processes frequently.

Setup walkthrough

Install MLC Chat from Google Play Store (free, open-source).
Open app → Model Store → download Llama 3.2 3B Q4_K_M (~2 GB).
Type "Summarize the pros and cons of electric vehicles." First response in 3-8 seconds on Snapdragon 8 Gen 3 or newer.
For Google AI features: on Pixel 8/9, Gemini Nano runs automatically for on-device summarization, smart reply, and Recorder transcription. No setup needed — it's built into Android.
For more model variety: install ChatterUI (open-source Android LLM client) or use Termux + pip install llama-cpp-python for CLI inference.
For Samsung devices: Samsung Gauss runs on-device for Galaxy AI features (translation, summarization) on Galaxy S24+.

All processing on-device. Works offline after model download.

The cheap setup

Google Pixel 8a (Tensor G3, 8 GB, ~$400-500 new). Runs Gemini Nano on-device for summarization, Recorder transcription, and smart reply. MLC Chat runs Llama 3.2 3B at 12-20 tok/s. Samsung Galaxy A55 5G (Exynos 1480, 8 GB, ~$350) runs 3B models at 8-15 tok/s via MLC Chat. For $300: used Pixel 7 Pro (Tensor G2, 12 GB, ~$300) runs 3B models at 10-18 tok/s with more RAM headroom. Android's advantage over iPhone for AI: more RAM at lower price points (12 GB on mid-range vs. 8 GB on flagship iPhone).

The serious setup

Samsung Galaxy S24 Ultra (Snapdragon 8 Gen 3, 12 GB, ~$1,300 new) or OnePlus 13 (Snapdragon 8 Elite, 12-16 GB). Runs Llama 3.2 3B at 20-35 tok/s, Qwen 2.5 7B Q4 at 10-18 tok/s. The 12-16 GB RAM allows running 7B-8B models comfortably without OS killing background apps. For Pixel fans: Pixel 9 Pro XL (Tensor G4, 16 GB, ~$1,100) runs Gemini Nano + custom models. The extra RAM on Android flagships (12-16 GB vs. iPhone's 8 GB) makes them better on-device AI platforms.

Common beginner mistake

The mistake: Installing a generic LLM app on a $200 Android phone with 4 GB RAM and a MediaTek chip, then wondering why inference is 10× slower than advertised. Why it fails: Most Android phones ship with budget SoCs (MediaTek Helio, Snapdragon 6-series) that lack a capable NPU and have slow LPDDR4 RAM. The NPU matters enormously — Snapdragon 8 Gen 3's Hexagon NPU provides ~45 TOPS vs. ~5 TOPS on budget SoCs. LPDDR5 bandwidth (50-70 GB/s) vs. LPDDR4 (20-30 GB/s) is the memory bottleneck for LLM inference. The fix: Check the SoC before buying. Minimum for usable on-device LLM: Snapdragon 8 Gen 2, Tensor G3, or Dimensity 9200+. These have capable NPUs + fast RAM. Don't trust "AI phone" marketing — check the actual SoC model.

Recommended setup for android ai

Recommended hardware

Recommended runtimes

Browse all tools for runtimes that fit this workload.

Budget build

AI PC under $1,000 →

Best GPU for this task

Best GPU for local AI →

Reality check

Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.

Common mistakes

Buying for spec-sheet VRAM without modeling KV cache + activation overhead
Underestimating quantization quality loss below Q4
Skipping flash-attention support (real perf gap on long context)
Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)

What breaks first

The errors most operators hit when running android ai locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle android ai before committing money.