RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to optimize vLLM for throughput
HOW-TO · SET

How to optimize vLLM for throughput

advanced·20 min·By Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.xWindows 11 · Ollama 0.4.xmacOS 15 · Ollama 0.4.x
PREREQUISITES

vLLM deployed with a model, benchmarking tool

What this does

Adjusts batching, parallelism, and memory settings to maximize tokens per second for a given model and GPU configuration. The goal is to keep all GPU compute units active by feeding a full batch of requests continuously.

Steps

  1. Increase max concurrent sequences. The --max-num-batched-tokens parameter sets how many tokens across all active sequences can be processed in one forward pass.

    vllm serve <model> \
      --max-num-batched-tokens 8192 \
      --gpu-memory-utilization 0.85
    

    Expected output: larger batches processed per iteration.

  2. Enable prefix caching for repeated contexts. When requests share a common system prompt, caching avoids recomputing attention.

    vllm serve <model> \
      --enable-chunked-prefill \
      --max-num-batched-tokens 4096
    

    Expected output: subsequent requests with identical prefixes show reduced prefill latency.

  3. Set appropriate tensor-parallel degree.

    vllm serve <model> \
      --tensor-parallel-size 2 \
      --num-speculative-decoder-samples 4
    

    Expected output: throughput scales linearly up to the bottleneck.

  4. Tune block size for KV cache. Smaller blocks increase memory utilization but raise CPU overhead.

    vllm serve <model> \
      --block-size 32 \
      --gpu-memory-utilization 0.88
    

    Expected output: fewer CPU overhead events per batch.

  5. Benchmark with a synthetic load.

    ab -n 1000 -c 32 -p requests.json -T application/json \
      http://localhost:8000/v1/chat/completions
    

    Expected output: requests-per-second figure; compare across configuration changes.

Verification

curl -s http://localhost:8000/metrics | grep vllm:num_generate_tokens_total
# Expected: an increasing counter confirming tokens are being generated continuously

Common failures

  • GPU utilization spikes then drops to zero — Memory overflow triggers OOM eviction. Lower --gpu-memory-utilization to 0.75.
  • Throughput lower than single-sequence baseline — Over-parallel configuration. Reduce --tensor-parallel-size to 1.
  • Chunked prefill causing output stalls — Increase --max-num-batched-tokens to at least 2x the longest prefill length.
  • Low tokens/second despite high GPU utilization — The bottleneck is network I/O. Move the client closer to the server.
  • Speculative decoding degradation — Draft model quality poor for the target domain. Disable with --num-speculative-decoder-samples 0.

Related guides

  • How to enable tensor parallelism in vLLM
  • How to configure vLLM GPU memory allocation
  • Course Ollama Deep Dive
RELATED GUIDES
SET
How to enable tensor parallelism in vLLM
SET
How to configure vLLM GPU memory allocation
← All how-to guidesCourses →