Hardware & infrastructure

GPU

A GPU (Graphics Processing Unit) is a specialized processor designed for parallel computation, originally for graphics but now essential for running neural networks. In local AI, the GPU accelerates model inference and training by handling matrix operations in parallel. The key operator-relevant spec is VRAM (video memory), which determines which models fit: a 16 GB GPU can run Llama 3.1 8B at Q4 (5 GB) with room for context, but a 70B model at Q4 (40 GB) requires a 48 GB card or system-RAM offload, which slows tokens/sec from ~40 to ~3-5.

Deeper dive

GPUs contain thousands of small cores optimized for parallel floating-point operations, making them far faster than CPUs for the matrix multiplications in transformer models. For local AI, the GPU's VRAM is the primary constraint: it must hold the model weights, KV cache, and intermediate activations. Quantization reduces the memory footprint (e.g., Q4 uses 4 bits per weight), allowing larger models to fit. Apple's M-series chips use unified memory, where the GPU and CPU share the same pool, effectively making the entire system RAM available as VRAM. This allows running models like Llama 3.1 70B Q4 on a 64 GB M-series Mac, though at lower tokens/sec than a dedicated GPU. On Windows/Linux, CUDA (NVIDIA) and ROCm (AMD) are the primary runtimes; llama.cpp supports both via Vulkan and Metal. Operators should monitor VRAM usage with tools like nvidia-smi or radeontop to avoid out-of-memory errors.

Practical example

A rig with an RTX 3090 (24 GB VRAM) can run Llama 3.1 8B at Q4_K_M (5 GB) with 32K context (2 GB KV cache), leaving room for batch processing. The same GPU cannot load Llama 3.1 70B Q4_K_M (~40 GB) without offloading to system RAM, which drops tokens/sec from ~40 to ~5. An RX 7900 XTX (24 GB) behaves similarly under ROCm. An M2 Ultra with 192 GB unified memory can run the 70B model entirely in GPU-accessible memory, achieving ~15 tok/s.

Workflow example

When running llama-cli -m model.gguf -ngl 99, the -ngl 99 flag tells llama.cpp to offload as many layers as possible to the GPU. If VRAM is insufficient, the runtime falls back to CPU layers, and you'll see a warning like 'offloaded 0/80 layers to GPU'. In LM Studio, the 'GPU Offload' slider controls this. In Ollama, ollama run llama3.1:70b automatically uses GPU if available; you can check with ollama ps to see if the model is loaded in VRAM.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work