RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Hardware & infrastructure / Edge AI
Hardware & infrastructure

Edge AI

Edge AI refers to running machine learning models locally on consumer hardware (laptops, phones, GPUs) rather than sending data to a cloud server. For local AI operators, this means models execute entirely on-device using runtimes like llama.cpp or Ollama, with inference latency determined by local compute (VRAM, GPU speed) rather than network round-trips. Edge AI matters because it enables offline use, lower latency, and data privacy—but it also constrains model size to what fits in available VRAM (e.g., 8 GB VRAM limits you to ~7B parameter models at Q4).

Deeper dive

Edge AI contrasts with cloud AI, where inference happens on remote servers. The key operator-relevant distinction is the hardware ceiling: edge devices have fixed VRAM (e.g., 8-24 GB on consumer GPUs, unified memory on Apple M-series) and limited compute. This forces trade-offs: smaller models, aggressive quantization (Q4_K_M, Q3_K_S), and context length limits. Runtimes like Ollama, LM Studio, and MLX are designed for edge deployment—they handle model loading, offloading, and prompt processing without internet dependency. Edge AI also includes on-device training (fine-tuning with LoRA on a single GPU), but inference is the primary use case. The term gained traction as models shrank (e.g., Llama 3.1 8B fits on a phone) and hardware improved (e.g., RTX 5090 with 32 GB VRAM).

Practical example

An operator with an RTX 3060 (12 GB VRAM) runs Llama 3.1 8B at Q4_K_M (5 GB) with a 4K context window. That's edge AI: the model stays entirely on the GPU, inference runs at ~30 tok/s. If they try Llama 3.1 70B Q4 (40 GB), the runtime must offload layers to system RAM, dropping to ~3 tok/s—still edge AI, but with degraded performance. On an Apple M2 Max with 64 GB unified memory, the same 70B model runs entirely in memory at ~10 tok/s, a better edge experience.

Workflow example

When an operator downloads a model via ollama pull llama3.1:8b and runs ollama run llama3.1:8b, they are executing edge AI. The model never leaves their machine. In LM Studio, selecting a model and clicking 'Start Server' loads it into local VRAM—if VRAM is insufficient, the UI shows a warning and falls back to CPU offload. In MLX on Apple Silicon, mlx_lm.generate --model path/to/model runs entirely on the Neural Engine and GPU, with no cloud dependency.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →