RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Hardware & infrastructure / On-Device AI
Hardware & infrastructure

On-Device AI

On-device AI refers to running machine learning models directly on local hardware (CPU, GPU, NPU) rather than sending data to a remote server for inference. For operators, this means models execute entirely on their own machine—no internet dependency, no cloud costs, and data never leaves the device. The tradeoff is limited compute and memory: consumer GPUs cap model size (e.g., 8B parameters at Q4 fits ~5 GB VRAM; 70B requires ~40 GB, often needing offload). On-device AI prioritizes privacy, latency, and offline capability over the massive scale of cloud-hosted models.

Deeper dive

On-device AI has become practical due to quantization (reducing weight precision from FP16 to 4-bit or 2-bit) and efficient architectures (e.g., Gemma 2, Phi-3). On a laptop with an Apple M-series chip, models up to 7B parameters run at usable speeds via MLX or llama.cpp. On a desktop with an RTX 4090 (24 GB VRAM), 13B models at Q4 fit comfortably, while 70B models require system-RAM offload, dropping tokens/sec from ~40 to ~5. The term contrasts with cloud AI: no API costs, no rate limits, but no access to trillion-parameter models. Operators choose on-device AI for sensitive data (medical, legal), offline environments, or low-latency applications like real-time voice assistants.

Practical example

An operator with an RTX 3060 (12 GB VRAM) can run Llama 3.1 8B at Q4_K_M (5 GB) with a 4K context window, achieving ~20 tok/s. The same model on a cloud API would cost ~$0.02 per query but adds 100-500 ms network latency. On-device AI eliminates that latency and recurring cost, but the operator cannot run Llama 3.1 70B Q4 (40 GB) without offloading to system RAM, slowing inference to ~2 tok/s.

Workflow example

In LM Studio, an operator selects a model (e.g., Phi-3-mini-4k-instruct) and clicks 'Load Model.' The app checks VRAM: if the model fits entirely in GPU memory, inference runs at full speed. If not, it offloads layers to system RAM—visible in the 'Offloaded Layers' slider. The operator can adjust context length to trade VRAM for speed. Similarly, in Ollama, ollama run llama3.1:8b loads the model into VRAM; ollama ps shows memory usage. If VRAM is insufficient, Ollama automatically offloads, and tokens/sec drops noticeably.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →