Training & optimization

Adam Optimizer

Adam (Adaptive Moment Estimation) is an optimizer that adjusts learning rates per parameter during training. It combines momentum (moving average of gradients) with RMSProp (moving average of squared gradients) to handle sparse gradients and noisy data. In local AI, operators encounter Adam when fine-tuning models with LoRA or full fine-tuning scripts—it's the default optimizer in Hugging Face Transformers and MLX. Adam requires storing two momentum buffers per parameter, roughly doubling VRAM usage compared to SGD, which matters for consumer GPUs with limited memory.

Deeper dive

Adam computes adaptive learning rates using estimates of first and second moments of gradients. The update rule: θ_t+1 = θ_t - α * m_t / (√v_t + ε), where m_t and v_t are bias-corrected moving averages. Key hyperparameters: β1 (default 0.9) controls momentum decay, β2 (default 0.999) controls squared gradient decay, and ε (1e-8) prevents division by zero. Variants like AdamW decouple weight decay from the adaptive updates, improving generalization. For operators, Adam's memory footprint is significant: each parameter requires two additional float32 buffers (m and v), so a 7B model in full precision (28 GB) needs ~84 GB for optimizer states alone. Mixed precision or 8-bit Adam (bitsandbytes) reduces this to ~42 GB or ~21 GB respectively, making fine-tuning feasible on 24 GB cards.

Practical example

Fine-tuning Llama 3.1 8B with LoRA on an RTX 4090 (24 GB VRAM). Using Hugging Face Transformers with AdamW in full precision, the optimizer states for the 8B model would require ~64 GB—impossible. Switching to 8-bit Adam via bitsandbytes reduces optimizer memory to ~16 GB, fitting within 24 GB alongside model weights and activations. The training script includes optimizer='adamw_8bit'.

Workflow example

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work