RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to fine-tune embedding batch sizes for your hardware
HOW-TO · INF

How to fine-tune embedding batch sizes for your hardware

intermediate·15 min·By Fredoline Eruo
PREREQUISITES

Embedding model, Python with sentence-transformers

What this does

Batching embedding requests improves throughput by processing multiple texts in parallel. This guide finds the optimal batch size that maximizes throughput without causing out-of-memory errors.

Steps

  1. Create a batch embedding test script.

    import requests, time, numpy as np
    
    def batch_embed(texts, batch_size):
        embeddings = []
        for i in range(0, len(texts), batch_size):
            batch = texts[i:i+batch_size]
            start = time.perf_counter()
            for text in batch:
                r = requests.post("http://localhost:11434/api/embeddings",
                    json={"model": "all-minilm", "prompt": text})
                embeddings.append(r.json()["embedding"])
            batch_time = time.perf_counter() - start
        return embeddings, batch_time / len(texts)
    
  2. Test multiple batch sizes and measure throughput.

    texts = ["Sample text"] * 100  # 100 identical texts for measurement
    for batch_size in [1, 2, 4, 8, 16, 32]:
        _, avg_time = batch_embed(texts, batch_size)
        throughput = 1 / avg_time if avg_time > 0 else 0
        print(f"Batch {batch_size:2d}: {avg_time*1000:.2f} ms/sample, {throughput:.0f} samples/sec")
    
  3. Monitor memory at each batch size.

    import subprocess
    def get_gpu_mem():
        result = subprocess.run(
            ["nvidia-smi", "--query-gpu=memory.used", "--format=csv,noheader"],
            capture_output=True, text=True)
        return int(result.stdout.strip().split()[0])
    
  4. Select the optimal batch size. The ideal point is where throughput plateaus before VRAM reaches 90%. Run a 2-minute stability test:

    for _ in range(20):
        _, _ = batch_embed(texts[:batch_size], batch_size)
        mem = get_gpu_mem()
        if mem > vram_total * 0.9:
            print(f"WARNING: Memory exceeded at batch {batch_size}")
            break
    

Verification

python tune_batch_size.py
# Expected output: Throughput increasing with batch size until plateau or OOM
# Example: Batch  1: 45.2 ms/sample, 22 samples/sec
# Example: Batch 16: 12.8 ms/sample, 78 samples/sec
# Example: Batch 32: 13.1 ms/sample, 76 samples/sec → plateau detected

Common failures

  • No throughput gain at larger batches: Ollama's embedding endpoint may process sequentially. Use sentence-transformers natively for true batching.
  • Memory leak in long tests: Restart the Ollama service between runs: ollama serve after ollama stop.
  • Contention with other GPU workloads: Run batch tuning in isolation for accurate measurements.

Related guides

  • How to run embedding models for semantic search
  • How to compare embedding model performance for your use case
RELATED GUIDES
INF
How to compare embedding model performance for your use case
INF
How to run embedding models for semantic search
← All how-to guidesCourses →