RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to run batch inference for processing multiple prompts
HOW-TO · INF

How to run batch inference for processing multiple prompts

intermediate·15 min·By Fredoline Eruo
PREREQUISITES

vLLM or compatible batch inference runtime

What this does

Batch inference processes many prompts in a single forward pass, maximizing GPU utilization and throughput. This guide covers sequential and parallel batching with vLLM and Ollama.

Steps

  1. Run batch inference with vLLM.

    from vllm import LLM, SamplingParams
    
    llm = LLM(model="meta-llama/Llama-3.2-3B")
    prompts = ["Explain Python generators", "Write a sorting function",
               "What is recursion?", "Define a closure"]
    params = SamplingParams(temperature=0.7, max_tokens=128)
    outputs = llm.generate(prompts, params)
    for i, o in enumerate(outputs):
        print(f"Prompt {i+1}: {o.outputs[0].text[:80]}...")
    
  2. Batch via Ollama with concurrent requests.

    # Using GNU Parallel (Linux/macOS)
    parallel -j 4 curl -s http://localhost:11434/api/generate \
      -d '{"model":"llama3.2","prompt":"{}","stream":false}' \
      ::: "Prompt 1" "Prompt 2" "Prompt 3" "Prompt 4"
    
  3. Measure batch throughput improvement.

    import time
    def measure_batch(prompts, batch_size):
        params = SamplingParams(max_tokens=64)
        llm = LLM(model="meta-llama/Llama-3.2-3B")
        start = time.perf_counter()
        for i in range(0, len(prompts), batch_size):
            llm.generate(prompts[i:i+batch_size], params)
        return time.perf_counter() - start
    
    single = measure_batch(prompts * 4, 1)
    batched = measure_batch(prompts * 4, 4)
    print(f"Single: {single:.2f}s, Batched: {batched:.2f}s, Speedup: {single/batched:.1f}x")
    
  • Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

Verification

python batch_inference.py
# Expected output: Speedup of 2-4x when using batched vs. sequential inference
# Example: Single: 12.5s, Batched: 4.2s, Speedup: 3.0x

Common failures

  • OOM with large batches: Reduce batch_size or enable --enable-chunked-prefill. Each prompt in the batch consumes KV cache.
  • Padding inefficiency: Variable-length prompts waste compute. Use a bucket-based batching strategy (group prompts of similar length).
  • vLLM process hangs: Ensure enough shared memory: sudo mount -t tmpfs -o size=64G tmpfs /dev/shm.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

Related guides

  • How to optimize batch inference throughput
  • How to configure batch size limits to prevent memory overflow
RELATED GUIDES
INF
How to configure batch size limits to prevent memory overflow
INF
How to optimize batch inference throughput
← All how-to guidesCourses →