Hardware & infrastructure

On-Device AI

On-device AI refers to running machine learning models directly on local hardware (CPU, GPU, NPU) rather than sending data to a remote server for inference. For operators, this means models execute entirely on their own machine—no internet dependency, no cloud costs, and data never leaves the device. The tradeoff is limited compute and memory: consumer GPUs cap model size (e.g., 8B parameters at Q4 fits ~5 GB VRAM; 70B requires ~40 GB, often needing offload). On-device AI prioritizes privacy, latency, and offline capability over the massive scale of cloud-hosted models.

Deeper dive

On-device AI has become practical due to quantization (reducing weight precision from FP16 to 4-bit or 2-bit) and efficient architectures (e.g., Gemma 2, Phi-3). On a laptop with an Apple M-series chip, models up to 7B parameters run at usable speeds via MLX or llama.cpp. On a desktop with an RTX 4090 (24 GB VRAM), 13B models at Q4 fit comfortably, while 70B models require system-RAM offload, dropping tokens/sec from ~40 to ~5. The term contrasts with cloud AI: no API costs, no rate limits, but no access to trillion-parameter models. Operators choose on-device AI for sensitive data (medical, legal), offline environments, or low-latency applications like real-time voice assistants.

Practical example

An operator with an RTX 3060 (12 GB VRAM) can run Llama 3.1 8B at Q4_K_M (5 GB) with a 4K context window, achieving ~20 tok/s. The same model on a cloud API would cost ~$0.02 per query but adds 100-500 ms network latency. On-device AI eliminates that latency and recurring cost, but the operator cannot run Llama 3.1 70B Q4 (40 GB) without offloading to system RAM, slowing inference to ~2 tok/s.

Workflow example

In LM Studio, an operator selects a model (e.g., Phi-3-mini-4k-instruct) and clicks 'Load Model.' The app checks VRAM: if the model fits entirely in GPU memory, inference runs at full speed. If not, it offloads layers to system RAM—visible in the 'Offloaded Layers' slider. The operator can adjust context length to trade VRAM for speed. Similarly, in Ollama, ollama run llama3.1:8b loads the model into VRAM; ollama ps shows memory usage. If VRAM is insufficient, Ollama automatically offloads, and tokens/sec drops noticeably.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work