What this does

Batch inference processes many prompts in a single forward pass, maximizing GPU utilization and throughput. This guide covers sequential and parallel batching with vLLM and Ollama.

Steps

Run batch inference with vLLM.

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.2-3B")
prompts = ["Explain Python generators", "Write a sorting function",
           "What is recursion?", "Define a closure"]
params = SamplingParams(temperature=0.7, max_tokens=128)
outputs = llm.generate(prompts, params)
for i, o in enumerate(outputs):
    print(f"Prompt {i+1}: {o.outputs[0].text[:80]}...")

Batch via Ollama with concurrent requests.

# Using GNU Parallel (Linux/macOS)
parallel -j 4 curl -s http://localhost:11434/api/generate \
  -d '{"model":"llama3.2","prompt":"{}","stream":false}' \
  ::: "Prompt 1" "Prompt 2" "Prompt 3" "Prompt 4"

Measure batch throughput improvement.

import time
def measure_batch(prompts, batch_size):
    params = SamplingParams(max_tokens=64)
    llm = LLM(model="meta-llama/Llama-3.2-3B")
    start = time.perf_counter()
    for i in range(0, len(prompts), batch_size):
        llm.generate(prompts[i:i+batch_size], params)
    return time.perf_counter() - start

single = measure_batch(prompts * 4, 1)
batched = measure_batch(prompts * 4, 4)
print(f"Single: {single:.2f}s, Batched: {batched:.2f}s, Speedup: {single/batched:.1f}x")

Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

Verification

python batch_inference.py
# Expected output: Speedup of 2-4x when using batched vs. sequential inference
# Example: Single: 12.5s, Batched: 4.2s, Speedup: 3.0x

Common failures

OOM with large batches: Reduce batch_size or enable --enable-chunked-prefill. Each prompt in the batch consumes KV cache.
Padding inefficiency: Variable-length prompts waste compute. Use a bucket-based batching strategy (group prompts of similar length).
vLLM process hangs: Ensure enough shared memory: sudo mount -t tmpfs -o size=64G tmpfs /dev/shm.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

How to run batch inference for processing multiple prompts

What this does

Steps

Verification

Common failures

Operator checkpoint

Related guides