Text Generation
Text generation is the process where a language model produces coherent sequences of tokens (words or subwords) in response to a prompt. In local AI, this means the model runs on your hardware, reading the prompt, then autoregressively predicting one token at a time until a stop condition (e.g., max tokens, end-of-sequence token). The operator controls generation via parameters like temperature (randomness), top-p (nucleus sampling), and max_tokens (length limit). Speed matters: tokens per second (tok/s) depends on model size, quantization, and VRAM bandwidth.
Practical example
Running Llama 3.1 8B at Q4_K_M on an RTX 4090 (24 GB VRAM) yields ~80-100 tok/s for a 512-token generation. The same model on an M1 Mac with 8 GB unified memory via MLX might do ~20-30 tok/s. A 70B model at Q4 requires ~40 GB VRAM; on a 24 GB card, offloading to system RAM drops speed to ~5-10 tok/s.
Workflow example
In llama.cpp, text generation is triggered by ./main -m model.gguf -p "Hello" -n 256. The runtime loads the model into VRAM, processes the prompt, then autoregressively generates 256 tokens. In Ollama, ollama run llama3.1:8b starts an interactive session; each prompt triggers generation until the model outputs a newline or stop token. In LM Studio, you select a model, type a prompt, and click 'Generate' — the UI shows tok/s and progress.
Reviewed by Fredoline Eruo. See our editorial policy.