RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Training & optimization / AdamW
Training & optimization

AdamW

AdamW is an optimizer algorithm used during fine-tuning or training of neural networks, including LLMs. It modifies the standard Adam optimizer by decoupling weight decay from the gradient update steps, applying weight decay directly to the parameters rather than through the adaptive learning rate. This improves generalization and training stability, especially for large models. Operators encounter AdamW when fine-tuning models with libraries like Hugging Face Transformers or Unsloth, where it is the default optimizer for most LLM training scripts.

Deeper dive

AdamW (Adam with Decoupled Weight Decay) was introduced by Loshchilov & Hutter in 2017 to fix a flaw in the original Adam optimizer. In Adam, weight decay is implemented as L2 regularization, which interacts with the adaptive learning rate and can lead to suboptimal regularization. AdamW separates the two: the gradient update uses Adam's adaptive moments, while weight decay is applied as a fixed-rate decay directly to the weights after the update. This decoupling makes hyperparameter tuning more predictable and often yields better validation loss. In practice, AdamW is the standard optimizer for training transformer-based models, including LLMs. Operators fine-tuning models with Hugging Face's Trainer or Unsloth will see AdamW as the default, with key hyperparameters like learning rate (e.g., 1e-5 to 5e-5) and weight decay (e.g., 0.01 to 0.1). The optimizer's memory footprint is roughly twice the model size (storing moments), which matters for VRAM-constrained rigs.

Practical example

When fine-tuning Llama 3.1 8B on a single RTX 4090 (24 GB VRAM) using Hugging Face Transformers, the Trainer defaults to AdamW. With batch size 1 and gradient accumulation, the optimizer states (two moment buffers) consume ~16 GB (2× model weights at 16-bit), leaving ~8 GB for activations. Operators often switch to 8-bit AdamW (bitsandbytes) to halve optimizer memory to ~8 GB, freeing VRAM for larger batch sizes or longer sequences.

Workflow example

In a typical fine-tuning script using Hugging Face Transformers, the optimizer is set via TrainingArguments: adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-8, weight_decay=0.01. The Trainer then instantiates AdamW internally. Operators using Unsloth can set optim='adamw_8bit' to use 8-bit AdamW, reducing VRAM usage. In llama.cpp's training mode (experimental), AdamW is used with similar defaults, but operators must set --adamw flag explicitly.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →