RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Training & optimization / Batch Size
Training & optimization

Batch Size

Batch size is the number of training samples processed together in one forward and backward pass. In local AI training, it directly determines VRAM usage: each sample's activations and gradients must be held in memory simultaneously. Larger batch sizes improve GPU utilization and training stability but require more VRAM; smaller batch sizes fit on consumer hardware but may increase training time and noise. Batch size is a key hyperparameter that trades off memory, speed, and model convergence.

Deeper dive

During training, the model computes gradients from a batch of samples, then updates weights. With a batch size of 1 (stochastic gradient descent), gradients are noisy but memory use is minimal. Larger batches (e.g., 32, 64) smooth gradients and allow matrix operations to saturate GPU cores, but each sample's intermediate values (activations) occupy VRAM. For a model like Llama 3.1 8B, a batch size of 1 at full precision might use ~16 GB VRAM; doubling the batch size nearly doubles activation memory. Operators finetuning on a 24 GB RTX 4090 often use batch sizes of 1-4 with gradient accumulation to simulate larger batches without exceeding VRAM. In inference, batch size is the number of prompts processed concurrently; vLLM and llama.cpp support dynamic batching to maximize throughput.

Practical example

Finetuning Llama 3.1 8B with LoRA on an RTX 4090 (24 GB VRAM): using a batch size of 1 consumes ~18 GB VRAM (model weights + activations + optimizer states). Increasing batch size to 4 would exceed 24 GB, causing out-of-memory errors. Operators instead set batch size=1 and gradient_accumulation_steps=4, which processes 4 samples sequentially, accumulating gradients before updating weights—achieving the effect of batch size=4 without extra VRAM.

Workflow example

In Hugging Face Transformers training scripts, batch size is set via per_device_train_batch_size. For example, python train.py --per_device_train_batch_size 2 --gradient_accumulation_steps 8 processes 2 samples per step and accumulates over 8 steps for an effective batch size of 16. In llama.cpp's finetune command, --batch-size 4 controls the number of samples per iteration. In vLLM inference, --max-num-batched-tokens and --max-num-seqs together determine how many requests are batched.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →