NPU (Neural Processing Unit)
A Neural Processing Unit (NPU) is a specialized hardware accelerator designed to execute neural network operations efficiently, typically found in modern CPUs and mobile SoCs. Unlike GPUs, which handle parallel compute broadly, NPUs are optimized for low-power, low-latency inference of small to medium models. In local AI workflows, NPUs can offload inference from the GPU or CPU, reducing power draw and freeing VRAM for larger models. However, NPUs often support only specific model formats (e.g., ONNX, TFLite) and lack the flexibility of GPU backends like CUDA or ROCm, making them less common for running general-purpose LLMs via llama.cpp or Ollama.
Deeper dive
NPUs are dedicated silicon blocks that accelerate matrix multiplications and activation functions common in neural networks. They trade general-purpose compute for energy efficiency, often achieving higher TOPS per watt than GPUs. In practice, NPUs appear in Apple's Neural Engine (ANE), Intel's OpenVINO NPU, Qualcomm's Hexagon DSP, and AMD's XDNA. For local AI operators, NPUs are most relevant on laptops and mobile devices where battery life matters. However, most open-source LLM runtimes (llama.cpp, Ollama) do not natively support NPUs; they rely on GPU or CPU backends. Exceptions include Apple's Core ML (which uses ANE for some models) and Intel's OpenVINO runtime. NPU support is growing, but operators should verify compatibility before expecting acceleration.
Practical example
On an Apple M1 MacBook Air, the 16-core Neural Engine can run a quantized MobileNet at ~30 ms per inference while drawing under 1 W, compared to ~10 ms on the GPU at 5 W. For LLMs, however, the ANE is limited: llama.cpp does not use it, and Core ML only supports models up to ~3B parameters. An operator running Llama 3.2 3B via MLX on an M1 will use the GPU, not the NPU, because MLX targets GPU and CPU.
Workflow example
When using LM Studio on a Windows laptop with an Intel Core Ultra (Meteor Lake), the operator can enable OpenVINO as the backend. Under Settings > Engine, selecting OpenVINO routes inference to the integrated NPU for supported models (e.g., Intel-optimized ONNX models). The operator will see lower power consumption but may encounter unsupported ops, causing fallback to CPU. For llama.cpp, NPU support is absent; operators should stick to GPU or CPU backends.
Reviewed by Fredoline Eruo. See our editorial policy.