Very slow first token / OOM only at long prompts
Cause
Prefill (the prompt-processing phase) is compute-bound and scales roughly quadratically with prompt length without Flash Attention, linearly with it. A 64K prompt is 32× longer than a 2K prompt — but the prefill cost is 32–1000× higher depending on attention implementation.
A second cause: KV cache for the long prompt may not fit, triggering OOM only when the prompt grows past a threshold even though shorter prompts are fine.
Solution
1. Enable Flash Attention (most runners support it; many don't enable by default on older GPUs):
# llama.cpp
./llama-server -m model.gguf -fa on # or --flash-attn
# vLLM (default on Ampere+)
vllm serve <model> --enforce-eager false
# Transformers
model = AutoModelForCausalLM.from_pretrained(name, attn_implementation="flash_attention_2")
2. Use prefix caching if the long context is repeated across requests (system prompt, RAG context):
vllm serve <model> --enable-prefix-caching
First request pays prefill, subsequent matching prefixes skip it.
3. Quantize the KV cache to fit longer context in the same VRAM:
./llama-server -m model.gguf -c 65536 --cache-type-k q8_0 --cache-type-v q8_0
4. Chunk the prompt. If you're feeding a 200K-token document, summarize segments first or use a model architecture designed for long context (Llama 4 Scout's 10M context was trained for it; Llama 3 at 128K via YaRN was extended and degrades past 32K in practice).
5. Confirm the model's attention is GPU-resident. A "tail" of CPU layers (-ngl 28 on a 32-layer model) gives correct output but kills prefill speed.
Related errors
Did this fix it?
If your case was different, email Contact support with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.