HOW-TO · INF
How to run batch inference for processing multiple prompts
PREREQUISITES
vLLM or compatible batch inference runtime
What this does
Batch inference processes many prompts in a single forward pass, maximizing GPU utilization and throughput. This guide covers sequential and parallel batching with vLLM and Ollama.
Steps
Run batch inference with vLLM.
from vllm import LLM, SamplingParams llm = LLM(model="meta-llama/Llama-3.2-3B") prompts = ["Explain Python generators", "Write a sorting function", "What is recursion?", "Define a closure"] params = SamplingParams(temperature=0.7, max_tokens=128) outputs = llm.generate(prompts, params) for i, o in enumerate(outputs): print(f"Prompt {i+1}: {o.outputs[0].text[:80]}...")Batch via Ollama with concurrent requests.
# Using GNU Parallel (Linux/macOS) parallel -j 4 curl -s http://localhost:11434/api/generate \ -d '{"model":"llama3.2","prompt":"{}","stream":false}' \ ::: "Prompt 1" "Prompt 2" "Prompt 3" "Prompt 4"Measure batch throughput improvement.
import time def measure_batch(prompts, batch_size): params = SamplingParams(max_tokens=64) llm = LLM(model="meta-llama/Llama-3.2-3B") start = time.perf_counter() for i in range(0, len(prompts), batch_size): llm.generate(prompts[i:i+batch_size], params) return time.perf_counter() - start single = measure_batch(prompts * 4, 1) batched = measure_batch(prompts * 4, 4) print(f"Single: {single:.2f}s, Batched: {batched:.2f}s, Speedup: {single/batched:.1f}x")
- Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Verification
python batch_inference.py
# Expected output: Speedup of 2-4x when using batched vs. sequential inference
# Example: Single: 12.5s, Batched: 4.2s, Speedup: 3.0x
Common failures
- OOM with large batches: Reduce
batch_sizeor enable--enable-chunked-prefill. Each prompt in the batch consumes KV cache. - Padding inefficiency: Variable-length prompts waste compute. Use a bucket-based batching strategy (group prompts of similar length).
- vLLM process hangs: Ensure enough shared memory:
sudo mount -t tmpfs -o size=64G tmpfs /dev/shm.
Operator checkpoint
Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.
Related guides
RELATED GUIDES