RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Training & optimization / Adam Optimizer
Training & optimization

Adam Optimizer

Adam (Adaptive Moment Estimation) is an optimizer that adjusts learning rates per parameter during training. It combines momentum (moving average of gradients) with RMSProp (moving average of squared gradients) to handle sparse gradients and noisy data. In local AI, operators encounter Adam when fine-tuning models with LoRA or full fine-tuning scripts—it's the default optimizer in Hugging Face Transformers and MLX. Adam requires storing two momentum buffers per parameter, roughly doubling VRAM usage compared to SGD, which matters for consumer GPUs with limited memory.

Deeper dive

Adam computes adaptive learning rates using estimates of first and second moments of gradients. The update rule: θ_t+1 = θ_t - α * m_t / (√v_t + ε), where m_t and v_t are bias-corrected moving averages. Key hyperparameters: β1 (default 0.9) controls momentum decay, β2 (default 0.999) controls squared gradient decay, and ε (1e-8) prevents division by zero. Variants like AdamW decouple weight decay from the adaptive updates, improving generalization. For operators, Adam's memory footprint is significant: each parameter requires two additional float32 buffers (m and v), so a 7B model in full precision (28 GB) needs ~84 GB for optimizer states alone. Mixed precision or 8-bit Adam (bitsandbytes) reduces this to ~42 GB or ~21 GB respectively, making fine-tuning feasible on 24 GB cards.

Practical example

Fine-tuning Llama 3.1 8B with LoRA on an RTX 4090 (24 GB VRAM). Using Hugging Face Transformers with AdamW in full precision, the optimizer states for the 8B model would require ~64 GB—impossible. Switching to 8-bit Adam via bitsandbytes reduces optimizer memory to ~16 GB, fitting within 24 GB alongside model weights and activations. The training script includes optimizer='adamw_8bit'.

Workflow example


Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →