What this does

Batching embedding requests improves throughput by processing multiple texts in parallel. This guide finds the optimal batch size that maximizes throughput without causing out-of-memory errors.

Steps

Create a batch embedding test script.

import requests, time, numpy as np

def batch_embed(texts, batch_size):
    embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        start = time.perf_counter()
        for text in batch:
            r = requests.post("http://localhost:11434/api/embeddings",
                json={"model": "all-minilm", "prompt": text})
            embeddings.append(r.json()["embedding"])
        batch_time = time.perf_counter() - start
    return embeddings, batch_time / len(texts)

Test multiple batch sizes and measure throughput.

texts = ["Sample text"] * 100  # 100 identical texts for measurement
for batch_size in [1, 2, 4, 8, 16, 32]:
    _, avg_time = batch_embed(texts, batch_size)
    throughput = 1 / avg_time if avg_time > 0 else 0
    print(f"Batch {batch_size:2d}: {avg_time*1000:.2f} ms/sample, {throughput:.0f} samples/sec")

Monitor memory at each batch size.

import subprocess
def get_gpu_mem():
    result = subprocess.run(
        ["nvidia-smi", "--query-gpu=memory.used", "--format=csv,noheader"],
        capture_output=True, text=True)
    return int(result.stdout.strip().split()[0])

Select the optimal batch size. The ideal point is where throughput plateaus before VRAM reaches 90%. Run a 2-minute stability test:

for _ in range(20):
    _, _ = batch_embed(texts[:batch_size], batch_size)
    mem = get_gpu_mem()
    if mem > vram_total * 0.9:
        print(f"WARNING: Memory exceeded at batch {batch_size}")
        break

Verification

python tune_batch_size.py
# Expected output: Throughput increasing with batch size until plateau or OOM
# Example: Batch  1: 45.2 ms/sample, 22 samples/sec
# Example: Batch 16: 12.8 ms/sample, 78 samples/sec
# Example: Batch 32: 13.1 ms/sample, 76 samples/sec → plateau detected

Common failures

No throughput gain at larger batches: Ollama's embedding endpoint may process sequentially. Use sentence-transformers natively for true batching.
Memory leak in long tests: Restart the Ollama service between runs: ollama serve after ollama stop.
Contention with other GPU workloads: Run batch tuning in isolation for accurate measurements.

How to fine-tune embedding batch sizes for your hardware

What this does

Steps

Verification

Common failures

Related guides