RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Training & optimization / Gradient Clipping
Training & optimization

Gradient Clipping

Gradient clipping is a technique used during neural network training to prevent exploding gradients. It caps the gradient values to a maximum threshold before updating model weights. In practice, operators training models on local hardware (e.g., fine-tuning Llama 3.1 8B with LoRA) encounter gradient clipping to stabilize training when gradients grow large due to high learning rates or noisy data. The most common variant is norm-based clipping, where the gradient vector's L2 norm is scaled down if it exceeds a set threshold (e.g., 1.0). This prevents weight updates from becoming excessively large, which can cause loss spikes or divergence.

Deeper dive

Gradient clipping addresses the exploding gradient problem in deep networks, especially recurrent architectures or transformers with long sequences. During backpropagation, gradients can grow exponentially with depth, leading to numerical overflow or unstable training. Clipping can be applied element-wise (value clipping) or by norm. Norm clipping is preferred because it preserves gradient direction while scaling magnitude. The threshold is a hyperparameter: too low slows convergence, too high may not prevent explosions. In local fine-tuning with limited precision (e.g., QLoRA on an RTX 3090), gradient clipping is often enabled by default in frameworks like Hugging Face Transformers or Axolotl. Operators may adjust the threshold based on loss behavior; a typical starting point is 1.0. Gradient clipping does not affect inference—only training.

Practical example

When fine-tuning Llama 3.1 8B on an RTX 3090 (24 GB VRAM) using QLoRA, the training script might include --gradient_clip_val 1.0. If the loss spikes after a few steps, the operator may lower the clip value to 0.5 or increase the batch size. Without clipping, gradients from a batch with rare tokens could cause a weight update that pushes the model into a high-loss region, requiring a restart.

Workflow example

In Hugging Face Transformers' Trainer, gradient clipping is set via args.max_grad_norm = 1.0. In Axolotl configs, it's gradient_clipping: 1.0. During training, the trainer computes gradients, clips them, then applies the optimizer step. Operators monitoring logs may see 'Gradient norm clipped' warnings if the norm exceeds the threshold. If training diverges, checking gradient norms (e.g., via torch.nn.utils.clip_grad_norm_) helps diagnose the need for clipping.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →