Generative AI

Autoregressive Models

Autoregressive models generate text one token at a time, where each new token depends on all previously generated tokens. In practice, this means the model runs a forward pass for each token, using the growing sequence as input. This sequential dependency makes generation inherently slower than parallel approaches, and the time to generate a response scales linearly with output length. For local AI operators, this directly impacts tokens-per-second: a model that processes 50 tokens per second will take 10 seconds to generate a 500-token response.

Deeper dive

Autoregressive models are the dominant architecture for text generation in local AI (e.g., GPT, Llama, Mistral). During inference, the model receives the prompt and then predicts the next token, appends it to the input, and repeats. This loop is called 'autoregressive decoding.' The key operator-relevant detail is that generation latency is proportional to output length, not input length. Techniques like KV caching (storing intermediate attention keys/values) avoid recomputing the entire sequence each step, speeding up generation by 2-10x. However, KV cache size grows with sequence length, consuming VRAM — a 4K context with Llama 3.1 8B uses ~1 GB of VRAM for the cache alone. Operators must balance context length, batch size, and quantization to stay within VRAM limits.

Practical example

When running Llama 3.1 8B at Q4_K_M on an RTX 4090 (24 GB VRAM), autoregressive generation yields ~80 tok/s for short outputs. But generating a 4096-token response takes ~50 seconds. If VRAM is tight (e.g., 12 GB card), KV cache for long contexts may force offloading to system RAM, dropping speed to ~10 tok/s. Operators often limit max output tokens or use smaller models to keep generation fast.

Workflow example

In llama.cpp, autoregressive generation is the default. When you run ./main -m model.gguf -p "Hello" -n 256, the model generates 256 tokens one by one. You can observe the token-by-token output in real time. In Ollama, the num_predict parameter controls output length. In vLLM, continuous batching processes multiple autoregressive streams concurrently, but each stream still generates sequentially. Operators tuning for low latency often set --num-predict 128 to cap output length.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work