Can phones actually run local LLMs in 2026?

The answer

One paragraph. No hedging beyond what the data actually warrants.

Short answer: yes — for models under ~4B params at Q4. No — for anything 7B+ that you'd actually want to use.

Phone-class SoCs (Apple A17 Pro / A18, Snapdragon 8 Gen 3, Google Tensor G4, Dimensity 9300) have 8-12 GB of unified memory shared with everything else the OS is doing. After iOS / Android keep ~3-4 GB for the system, you have ~5-8 GB usable for a model — and the model has to coexist with whatever app is rendering the UI on top.

What actually runs:

Class	Model	Q4_K_M size	Realistic tok/s
Sub-1B	Llama 3.2 1B, Qwen 2.5 0.5B	~0.7-1 GB	25-50 tok/s
1-3B	Llama 3.2 3B, Phi 3.5 mini, Qwen 3 4B	~2-3 GB	8-18 tok/s
3-4B	Phi 4 mini, Gemma 3 4B	~3-4 GB	4-10 tok/s
7B+	Llama 3.1 8B, Qwen 3 7B	~5 GB	1-3 tok/s (unusable)

Apps that work today:

iOS: Private LLM, Pocket LLM, Apollo, MLC Chat (free), Layla (free + paid).
Android: MLC Chat (open source), Layla, PocketPal (open source), MaidApp, Cactus.
Cross-platform via tunneling: Ollama on a desktop + an iOS / Android client over Tailscale. Strictly speaking the model isn't running on the phone — but the user experience is "AI on my phone, private, no cloud."

Where the wall is — three real limits, in order of how often they bite:

RAM ceiling. Most phones above the budget tier ship with 8 GB; the iPhone 16 Pro and a few Android flagships ship with 12 GB. Once you've eaten 5+ GB for the model, the OS will start aggressively killing the app when you switch to anything else. A model bigger than ~4B is not realistically usable on a phone you also need to use as a phone.
Thermal throttling. Long generations (>1 minute) heat the SoC into the throttle zone, dropping decode rates by 30-60%. Phones are not designed for sustained compute. If you're getting 10 tok/s in the first 30 seconds, expect 5-7 by the two-minute mark.
No CUDA / Metal / NPU acceleration in most apps. The Apple Neural Engine and Qualcomm NPU could in principle accelerate decode by 3-5×, but most apps still run the entire workload on the CPU. The few apps using the NPU (Apple's on-device Genmoji model, some MLC builds) show the headroom; the typical chat app does not.

Should you bother? Depends on the workload:

Yes for a private offline chat companion you'd use occasionally, draft-mode writing aid, or a privacy-sensitive note-summarizer.
No for any agent loop, coding assistance, RAG over real documents, or anything you'd use multiple times a day. A $700 used RTX 3090 running Qwen 3 14B over Tailscale is a better mobile-friendly setup than even the best phone-side build.

Explore the numbers for your specific stack

Will It Run — Llama 3.2 3B on phone-class hardware →

Plug in your phone's RAM and see whether Llama 3.2 1B / 3B fits with headroom for the OS to still breathe.

Where we got the numbers

tok/s envelopes from MLC LLM benchmark reports, Private LLM developer notes, and Layla user posts on r/LocalLLaMA (Apr-May 2026). RAM ceilings from manufacturer specs. We don't have a measured phone-bench harness yet — treat these numbers as community-reported until we do.

Also see

Llama 3.2 3B →

The current sweet-spot model for phone-class hardware.

Mobile chat apps →

Directory of iOS / Android apps that run a local model.

Ollama →

Run on a desktop, hit it from your phone via Tailscale. Better UX than any phone-native option for serious use.

Can phones actually run local LLMs in 2026?

The answer

Explore the numbers for your specific stack

Where we got the numbers

Also see

Other questions in this thread