Batch Size
Batch size is the number of training samples processed together in one forward and backward pass. In local AI training, it directly determines VRAM usage: each sample's activations and gradients must be held in memory simultaneously. Larger batch sizes improve GPU utilization and training stability but require more VRAM; smaller batch sizes fit on consumer hardware but may increase training time and noise. Batch size is a key hyperparameter that trades off memory, speed, and model convergence.
Deeper dive
During training, the model computes gradients from a batch of samples, then updates weights. With a batch size of 1 (stochastic gradient descent), gradients are noisy but memory use is minimal. Larger batches (e.g., 32, 64) smooth gradients and allow matrix operations to saturate GPU cores, but each sample's intermediate values (activations) occupy VRAM. For a model like Llama 3.1 8B, a batch size of 1 at full precision might use ~16 GB VRAM; doubling the batch size nearly doubles activation memory. Operators finetuning on a 24 GB RTX 4090 often use batch sizes of 1-4 with gradient accumulation to simulate larger batches without exceeding VRAM. In inference, batch size is the number of prompts processed concurrently; vLLM and llama.cpp support dynamic batching to maximize throughput.
Practical example
Finetuning Llama 3.1 8B with LoRA on an RTX 4090 (24 GB VRAM): using a batch size of 1 consumes ~18 GB VRAM (model weights + activations + optimizer states). Increasing batch size to 4 would exceed 24 GB, causing out-of-memory errors. Operators instead set batch size=1 and gradient_accumulation_steps=4, which processes 4 samples sequentially, accumulating gradients before updating weights—achieving the effect of batch size=4 without extra VRAM.
Workflow example
In Hugging Face Transformers training scripts, batch size is set via per_device_train_batch_size. For example, python train.py --per_device_train_batch_size 2 --gradient_accumulation_steps 8 processes 2 samples per step and accumulates over 8 steps for an effective batch size of 16. In llama.cpp's finetune command, --batch-size 4 controls the number of samples per iteration. In vLLM inference, --max-num-batched-tokens and --max-num-seqs together determine how many requests are batched.
Reviewed by Fredoline Eruo. See our editorial policy.