Batch Inference
Definition pending
We've cataloged "Batch Inference" but haven't written a full definition yet. Definitions are hand-curated rather than auto-generated, so it takes time to cover every term.
Want this one prioritized? Email us and we'll bump it.
Practical example
Batch inference processes many inputs together offline — run 10,000 product descriptions through a summarization model overnight. It's the most efficient mode: higher GPU utilization, better throughput, and you can use cheaper spot instances. Trade-off: results aren't real-time.
Workflow example
Batch inference setup: (1) collect inputs in a queue (S3 bucket, database table), (2) run batch job: load model, feed all inputs through, write outputs, (3) for LLMs: use vLLM's offline batched mode — 10–100× higher throughput than real-time API, (4) scheduling: run at off-peak hours (night, weekend) for cheapest compute, (5) for applications where results aren't needed instantly (catalog enrichment, daily summaries), batch inference is the right choice.