Training & optimization

AdamW

AdamW is an optimizer algorithm used during fine-tuning or training of neural networks, including LLMs. It modifies the standard Adam optimizer by decoupling weight decay from the gradient update steps, applying weight decay directly to the parameters rather than through the adaptive learning rate. This improves generalization and training stability, especially for large models. Operators encounter AdamW when fine-tuning models with libraries like Hugging Face Transformers or Unsloth, where it is the default optimizer for most LLM training scripts.

Deeper dive

AdamW (Adam with Decoupled Weight Decay) was introduced by Loshchilov & Hutter in 2017 to fix a flaw in the original Adam optimizer. In Adam, weight decay is implemented as L2 regularization, which interacts with the adaptive learning rate and can lead to suboptimal regularization. AdamW separates the two: the gradient update uses Adam's adaptive moments, while weight decay is applied as a fixed-rate decay directly to the weights after the update. This decoupling makes hyperparameter tuning more predictable and often yields better validation loss. In practice, AdamW is the standard optimizer for training transformer-based models, including LLMs. Operators fine-tuning models with Hugging Face's Trainer or Unsloth will see AdamW as the default, with key hyperparameters like learning rate (e.g., 1e-5 to 5e-5) and weight decay (e.g., 0.01 to 0.1). The optimizer's memory footprint is roughly twice the model size (storing moments), which matters for VRAM-constrained rigs.

Practical example

When fine-tuning Llama 3.1 8B on a single RTX 4090 (24 GB VRAM) using Hugging Face Transformers, the Trainer defaults to AdamW. With batch size 1 and gradient accumulation, the optimizer states (two moment buffers) consume ~16 GB (2× model weights at 16-bit), leaving ~8 GB for activations. Operators often switch to 8-bit AdamW (bitsandbytes) to halve optimizer memory to ~8 GB, freeing VRAM for larger batch sizes or longer sequences.

Workflow example

In a typical fine-tuning script using Hugging Face Transformers, the optimizer is set via TrainingArguments: adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-8, weight_decay=0.01. The Trainer then instantiates AdamW internally. Operators using Unsloth can set optim='adamw_8bit' to use 8-bit AdamW, reducing VRAM usage. In llama.cpp's training mode (experimental), AdamW is used with similar defaults, but operators must set --adamw flag explicitly.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work