HOW-TO · INF
How to fine-tune embedding batch sizes for your hardware
PREREQUISITES
Embedding model, Python with sentence-transformers
What this does
Batching embedding requests improves throughput by processing multiple texts in parallel. This guide finds the optimal batch size that maximizes throughput without causing out-of-memory errors.
Steps
Create a batch embedding test script.
import requests, time, numpy as np def batch_embed(texts, batch_size): embeddings = [] for i in range(0, len(texts), batch_size): batch = texts[i:i+batch_size] start = time.perf_counter() for text in batch: r = requests.post("http://localhost:11434/api/embeddings", json={"model": "all-minilm", "prompt": text}) embeddings.append(r.json()["embedding"]) batch_time = time.perf_counter() - start return embeddings, batch_time / len(texts)Test multiple batch sizes and measure throughput.
texts = ["Sample text"] * 100 # 100 identical texts for measurement for batch_size in [1, 2, 4, 8, 16, 32]: _, avg_time = batch_embed(texts, batch_size) throughput = 1 / avg_time if avg_time > 0 else 0 print(f"Batch {batch_size:2d}: {avg_time*1000:.2f} ms/sample, {throughput:.0f} samples/sec")Monitor memory at each batch size.
import subprocess def get_gpu_mem(): result = subprocess.run( ["nvidia-smi", "--query-gpu=memory.used", "--format=csv,noheader"], capture_output=True, text=True) return int(result.stdout.strip().split()[0])Select the optimal batch size. The ideal point is where throughput plateaus before VRAM reaches 90%. Run a 2-minute stability test:
for _ in range(20): _, _ = batch_embed(texts[:batch_size], batch_size) mem = get_gpu_mem() if mem > vram_total * 0.9: print(f"WARNING: Memory exceeded at batch {batch_size}") break
Verification
python tune_batch_size.py
# Expected output: Throughput increasing with batch size until plateau or OOM
# Example: Batch 1: 45.2 ms/sample, 22 samples/sec
# Example: Batch 16: 12.8 ms/sample, 78 samples/sec
# Example: Batch 32: 13.1 ms/sample, 76 samples/sec → plateau detected
Common failures
- No throughput gain at larger batches: Ollama's embedding endpoint may process sequentially. Use sentence-transformers natively for true batching.
- Memory leak in long tests: Restart the Ollama service between runs:
ollama serveafterollama stop. - Contention with other GPU workloads: Run batch tuning in isolation for accurate measurements.
Related guides
RELATED GUIDES