Can phones actually run local LLMs in 2026?
The answer
One paragraph. No hedging beyond what the data actually warrants.
Short answer: yes — for models under ~4B params at Q4. No — for anything 7B+ that you'd actually want to use.
Phone-class SoCs (Apple A17 Pro / A18, Snapdragon 8 Gen 3, Google Tensor G4, Dimensity 9300) have 8-12 GB of unified memory shared with everything else the OS is doing. After iOS / Android keep ~3-4 GB for the system, you have ~5-8 GB usable for a model — and the model has to coexist with whatever app is rendering the UI on top.
What actually runs:
| Class | Model | Q4_K_M size | Realistic tok/s |
|---|---|---|---|
| Sub-1B | Llama 3.2 1B, Qwen 2.5 0.5B | ~0.7-1 GB | 25-50 tok/s |
| 1-3B | Llama 3.2 3B, Phi 3.5 mini, Qwen 3 4B | ~2-3 GB | 8-18 tok/s |
| 3-4B | Phi 4 mini, Gemma 3 4B | ~3-4 GB | 4-10 tok/s |
| 7B+ | Llama 3.1 8B, Qwen 3 7B | ~5 GB | 1-3 tok/s (unusable) |
Apps that work today:
- iOS: Private LLM, Pocket LLM, Apollo, MLC Chat (free), Layla (free + paid).
- Android: MLC Chat (open source), Layla, PocketPal (open source), MaidApp, Cactus.
- Cross-platform via tunneling: Ollama on a desktop + an iOS / Android client over Tailscale. Strictly speaking the model isn't running on the phone — but the user experience is "AI on my phone, private, no cloud."
Where the wall is — three real limits, in order of how often they bite:
- RAM ceiling. Most phones above the budget tier ship with 8 GB; the iPhone 16 Pro and a few Android flagships ship with 12 GB. Once you've eaten 5+ GB for the model, the OS will start aggressively killing the app when you switch to anything else. A model bigger than ~4B is not realistically usable on a phone you also need to use as a phone.
- Thermal throttling. Long generations (>1 minute) heat the SoC into the throttle zone, dropping decode rates by 30-60%. Phones are not designed for sustained compute. If you're getting 10 tok/s in the first 30 seconds, expect 5-7 by the two-minute mark.
- No CUDA / Metal / NPU acceleration in most apps. The Apple Neural Engine and Qualcomm NPU could in principle accelerate decode by 3-5×, but most apps still run the entire workload on the CPU. The few apps using the NPU (Apple's on-device Genmoji model, some MLC builds) show the headroom; the typical chat app does not.
Should you bother? Depends on the workload:
- Yes for a private offline chat companion you'd use occasionally, draft-mode writing aid, or a privacy-sensitive note-summarizer.
- No for any agent loop, coding assistance, RAG over real documents, or anything you'd use multiple times a day. A $700 used RTX 3090 running Qwen 3 14B over Tailscale is a better mobile-friendly setup than even the best phone-side build.
Explore the numbers for your specific stack
Where we got the numbers
tok/s envelopes from MLC LLM benchmark reports, Private LLM developer notes, and Layla user posts on r/LocalLLaMA (Apr-May 2026). RAM ceilings from manufacturer specs. We don't have a measured phone-bench harness yet — treat these numbers as community-reported until we do.
Also see
Other questions in this thread
Other /q/ landings on the same topic — same editorial discipline.
Found this via a forum search? Bookmark the URL — we update these pages as new data lands. Have a question that should live here? Open a GitHub issue.