Edge AI
Edge AI refers to running machine learning models locally on consumer hardware (laptops, phones, GPUs) rather than sending data to a cloud server. For local AI operators, this means models execute entirely on-device using runtimes like llama.cpp or Ollama, with inference latency determined by local compute (VRAM, GPU speed) rather than network round-trips. Edge AI matters because it enables offline use, lower latency, and data privacy—but it also constrains model size to what fits in available VRAM (e.g., 8 GB VRAM limits you to ~7B parameter models at Q4).
Deeper dive
Edge AI contrasts with cloud AI, where inference happens on remote servers. The key operator-relevant distinction is the hardware ceiling: edge devices have fixed VRAM (e.g., 8-24 GB on consumer GPUs, unified memory on Apple M-series) and limited compute. This forces trade-offs: smaller models, aggressive quantization (Q4_K_M, Q3_K_S), and context length limits. Runtimes like Ollama, LM Studio, and MLX are designed for edge deployment—they handle model loading, offloading, and prompt processing without internet dependency. Edge AI also includes on-device training (fine-tuning with LoRA on a single GPU), but inference is the primary use case. The term gained traction as models shrank (e.g., Llama 3.1 8B fits on a phone) and hardware improved (e.g., RTX 5090 with 32 GB VRAM).
Practical example
An operator with an RTX 3060 (12 GB VRAM) runs Llama 3.1 8B at Q4_K_M (5 GB) with a 4K context window. That's edge AI: the model stays entirely on the GPU, inference runs at ~30 tok/s. If they try Llama 3.1 70B Q4 (40 GB), the runtime must offload layers to system RAM, dropping to ~3 tok/s—still edge AI, but with degraded performance. On an Apple M2 Max with 64 GB unified memory, the same 70B model runs entirely in memory at ~10 tok/s, a better edge experience.
Workflow example
When an operator downloads a model via ollama pull llama3.1:8b and runs ollama run llama3.1:8b, they are executing edge AI. The model never leaves their machine. In LM Studio, selecting a model and clicking 'Start Server' loads it into local VRAM—if VRAM is insufficient, the UI shows a warning and falls back to CPU offload. In MLX on Apple Silicon, mlx_lm.generate --model path/to/model runs entirely on the Neural Engine and GPU, with no cloud dependency.
Reviewed by Fredoline Eruo. See our editorial policy.