Training & optimization

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD) is an optimization algorithm used during model training to minimize the loss function. Unlike standard gradient descent, which computes the gradient over the entire dataset, SGD updates model weights using a single random training example (or a small batch) per iteration. This introduces noise but allows faster updates and helps escape local minima. For local AI operators, SGD is the default optimizer in many training scripts (e.g., Hugging Face Transformers) and is often replaced by variants like AdamW for fine-tuning, as SGD requires careful learning rate tuning and may converge slower on modern architectures.

Deeper dive

SGD updates weights as θ = θ - η * ∇L(θ; x_i, y_i), where η is the learning rate and ∇L is the gradient computed on a single sample. The 'stochastic' aspect comes from random sample selection each iteration, which introduces variance in gradient estimates. This variance can help the optimizer escape sharp minima but may cause oscillations. In practice, mini-batch SGD (batch size 8-128) balances noise and stability. For local fine-tuning, operators often use AdamW because it adapts learning rates per parameter and includes weight decay, reducing sensitivity to hyperparameters. However, SGD with momentum (SGD+M) remains common for computer vision models and when training from scratch on large datasets. The key operator concern: SGD requires more epochs to converge than adaptive methods, so training time on consumer GPUs can be significantly longer.

Practical example

Fine-tuning a 7B parameter model on an RTX 4090 (24 GB VRAM) with SGD might use a batch size of 1 (due to memory limits) and a learning rate of 1e-5. Training for 3 epochs on a 10k sample dataset could take 12 hours, whereas AdamW might converge in 1 epoch (4 hours) with similar final loss. The operator must monitor loss curves to avoid divergence—SGD is more sensitive to learning rate choice.

Workflow example

In Hugging Face Transformers, setting optim='sgd' in TrainingArguments enables SGD. For example: TrainingArguments(per_device_train_batch_size=1, learning_rate=1e-5, optim='sgd', num_train_epochs=3). During training, the loss per step will fluctuate more than with AdamW. Operators can add momentum via optim='sgd' and setting momentum=0.9 in the optimizer config. In llama.cpp, training is not supported, but fine-tuning scripts often use SGD for custom models.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work