Core concepts & fields

Inference (logical)

Inference is the process of running a trained model on input data to generate an output — the "forward pass" that produces predictions, text, or classifications. For local AI operators, inference is what happens when you send a prompt to a model and get a response. It contrasts with training, which updates model weights. Inference is the operational phase: the model is static, and the runtime (llama.cpp, Ollama, vLLM) loads the weights into VRAM and computes outputs token by token. Latency and throughput (tokens/sec) are the key metrics, constrained by VRAM, memory bandwidth, and quantization level.

Practical example

On an RTX 4090 (24 GB VRAM), running Llama 3.1 8B at Q4_K_M (~5 GB) achieves ~80 tok/s inference. The same model on an RTX 3060 (12 GB) might run at ~30 tok/s due to lower memory bandwidth. If VRAM is insufficient, the runtime offloads layers to system RAM, dropping speed to ~3-5 tok/s — the operator sees this as sluggish responses.

Workflow example

When you run ollama run llama3.1:8b and type a prompt, Ollama loads the model into VRAM (if available) and performs inference: it tokenizes the input, runs the forward pass through the transformer layers, and decodes tokens one at a time. The --num-gpu-layers flag in llama.cpp controls how many layers are offloaded to GPU — setting it too high with limited VRAM causes out-of-memory errors.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work