Can phones actually run local LLMs in 2026?

Reviewed May 15, 2026By Fredoline Eruo2 min read
mobilephoneedgellama-3-2small-models

The answer

One paragraph. No hedging beyond what the data actually warrants.

Short answer: yes — for models under ~4B params at Q4. No — for anything 7B+ that you'd actually want to use.

Phone-class SoCs (Apple A17 Pro / A18, Snapdragon 8 Gen 3, Google Tensor G4, Dimensity 9300) have 8-12 GB of unified memory shared with everything else the OS is doing. After iOS / Android keep ~3-4 GB for the system, you have ~5-8 GB usable for a model — and the model has to coexist with whatever app is rendering the UI on top.

What actually runs:

Class Model Q4_K_M size Realistic tok/s
Sub-1B Llama 3.2 1B, Qwen 2.5 0.5B ~0.7-1 GB 25-50 tok/s
1-3B Llama 3.2 3B, Phi 3.5 mini, Qwen 3 4B ~2-3 GB 8-18 tok/s
3-4B Phi 4 mini, Gemma 3 4B ~3-4 GB 4-10 tok/s
7B+ Llama 3.1 8B, Qwen 3 7B ~5 GB 1-3 tok/s (unusable)

Apps that work today:

  • iOS: Private LLM, Pocket LLM, Apollo, MLC Chat (free), Layla (free + paid).
  • Android: MLC Chat (open source), Layla, PocketPal (open source), MaidApp, Cactus.
  • Cross-platform via tunneling: Ollama on a desktop + an iOS / Android client over Tailscale. Strictly speaking the model isn't running on the phone — but the user experience is "AI on my phone, private, no cloud."

Where the wall is — three real limits, in order of how often they bite:

  1. RAM ceiling. Most phones above the budget tier ship with 8 GB; the iPhone 16 Pro and a few Android flagships ship with 12 GB. Once you've eaten 5+ GB for the model, the OS will start aggressively killing the app when you switch to anything else. A model bigger than ~4B is not realistically usable on a phone you also need to use as a phone.
  2. Thermal throttling. Long generations (>1 minute) heat the SoC into the throttle zone, dropping decode rates by 30-60%. Phones are not designed for sustained compute. If you're getting 10 tok/s in the first 30 seconds, expect 5-7 by the two-minute mark.
  3. No CUDA / Metal / NPU acceleration in most apps. The Apple Neural Engine and Qualcomm NPU could in principle accelerate decode by 3-5×, but most apps still run the entire workload on the CPU. The few apps using the NPU (Apple's on-device Genmoji model, some MLC builds) show the headroom; the typical chat app does not.

Should you bother? Depends on the workload:

  • Yes for a private offline chat companion you'd use occasionally, draft-mode writing aid, or a privacy-sensitive note-summarizer.
  • No for any agent loop, coding assistance, RAG over real documents, or anything you'd use multiple times a day. A $700 used RTX 3090 running Qwen 3 14B over Tailscale is a better mobile-friendly setup than even the best phone-side build.

Where we got the numbers

tok/s envelopes from MLC LLM benchmark reports, Private LLM developer notes, and Layla user posts on r/LocalLLaMA (Apr-May 2026). RAM ceilings from manufacturer specs. We don't have a measured phone-bench harness yet — treat these numbers as community-reported until we do.

Get monthly local AI changesMonthly recap. No spam.

Other questions in this thread

Other /q/ landings on the same topic — same editorial discipline.

Found this via a forum search? Bookmark the URL — we update these pages as new data lands. Have a question that should live here? Open a GitHub issue.