RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Training & optimization / Stochastic Gradient Descent (SGD)
Training & optimization

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD) is an optimization algorithm used during model training to minimize the loss function. Unlike standard gradient descent, which computes the gradient over the entire dataset, SGD updates model weights using a single random training example (or a small batch) per iteration. This introduces noise but allows faster updates and helps escape local minima. For local AI operators, SGD is the default optimizer in many training scripts (e.g., Hugging Face Transformers) and is often replaced by variants like AdamW for fine-tuning, as SGD requires careful learning rate tuning and may converge slower on modern architectures.

Deeper dive

SGD updates weights as θ = θ - η * ∇L(θ; x_i, y_i), where η is the learning rate and ∇L is the gradient computed on a single sample. The 'stochastic' aspect comes from random sample selection each iteration, which introduces variance in gradient estimates. This variance can help the optimizer escape sharp minima but may cause oscillations. In practice, mini-batch SGD (batch size 8-128) balances noise and stability. For local fine-tuning, operators often use AdamW because it adapts learning rates per parameter and includes weight decay, reducing sensitivity to hyperparameters. However, SGD with momentum (SGD+M) remains common for computer vision models and when training from scratch on large datasets. The key operator concern: SGD requires more epochs to converge than adaptive methods, so training time on consumer GPUs can be significantly longer.

Practical example

Fine-tuning a 7B parameter model on an RTX 4090 (24 GB VRAM) with SGD might use a batch size of 1 (due to memory limits) and a learning rate of 1e-5. Training for 3 epochs on a 10k sample dataset could take 12 hours, whereas AdamW might converge in 1 epoch (4 hours) with similar final loss. The operator must monitor loss curves to avoid divergence—SGD is more sensitive to learning rate choice.

Workflow example

In Hugging Face Transformers, setting optim='sgd' in TrainingArguments enables SGD. For example: TrainingArguments(per_device_train_batch_size=1, learning_rate=1e-5, optim='sgd', num_train_epochs=3). During training, the loss per step will fluctuate more than with AdamW. Operators can add momentum via optim='sgd' and setting momentum=0.9 in the optimizer config. In llama.cpp, training is not supported, but fine-tuning scripts often use SGD for custom models.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →