MLOps & deployment

Real-Time Inference

Real-time inference means the model processes input and returns output fast enough to feel instantaneous to a human user — typically under 200–500 milliseconds per response. For local AI, this is the difference between a chatbot that replies as you type and one that stalls for seconds. Achieving real-time inference on consumer hardware requires balancing model size, quantization level, context length, and token generation speed (tokens per second). Operators targeting real-time often use 7B–13B parameter models at 4-bit or 8-bit quantization, and keep context windows under 8K tokens to stay within VRAM limits.

Deeper dive

Real-time inference is not a fixed speed — it depends on the use case. For voice assistants, latency must be under 300 ms to avoid awkward pauses. For code autocomplete, sub-100 ms per suggestion is expected. For chatbots, 10–20 tokens per second (tok/s) feels fluid. On local hardware, the bottleneck is memory bandwidth and compute. A 7B model at Q4_K_M on an RTX 4090 generates ~100 tok/s, well into real-time. The same model on an Apple M1 MacBook Air (7-core GPU) runs ~15 tok/s — acceptable for chat but not for rapid iteration. Operators must also account for prompt processing time (prefill), which adds to first-token latency. Techniques like speculative decoding, KV-cache quantization, and prompt caching help reduce latency without sacrificing quality.

Practical example

A 13B model at Q4 on an RTX 3060 12GB generates ~15 tok/s — borderline for real-time chat. Dropping to a 7B model at Q4_K_M on the same card yields ~40 tok/s, which feels responsive. On an Apple M2 Max (38-core GPU), a 7B Q4 model runs ~30 tok/s, sufficient for real-time use. If the operator needs real-time code completion, a 1.5B model (e.g., DeepSeek-Coder 1.3B) at Q8 on an RTX 3060 can hit ~100 tok/s, meeting the sub-100 ms requirement.

Workflow example

In LM Studio, an operator selects a model and watches the 'Inference Speed' indicator. If it shows <10 tok/s, they switch to a smaller quantized model. In llama.cpp, running ./main -m model.gguf -n 256 -t 8 and seeing output appear character-by-character indicates non-real-time. To achieve real-time, operators lower -n (max tokens), reduce context size (-c 2048), or use --no-mmap to avoid disk thrashing. In Ollama, the OLLAMA_NUM_PARALLEL environment variable can be set to 1 to prioritize single-request latency over throughput.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work