Training & optimization

Gradient Clipping

Gradient clipping is a technique used during neural network training to prevent exploding gradients. It caps the gradient values to a maximum threshold before updating model weights. In practice, operators training models on local hardware (e.g., fine-tuning Llama 3.1 8B with LoRA) encounter gradient clipping to stabilize training when gradients grow large due to high learning rates or noisy data. The most common variant is norm-based clipping, where the gradient vector's L2 norm is scaled down if it exceeds a set threshold (e.g., 1.0). This prevents weight updates from becoming excessively large, which can cause loss spikes or divergence.

Deeper dive

Gradient clipping addresses the exploding gradient problem in deep networks, especially recurrent architectures or transformers with long sequences. During backpropagation, gradients can grow exponentially with depth, leading to numerical overflow or unstable training. Clipping can be applied element-wise (value clipping) or by norm. Norm clipping is preferred because it preserves gradient direction while scaling magnitude. The threshold is a hyperparameter: too low slows convergence, too high may not prevent explosions. In local fine-tuning with limited precision (e.g., QLoRA on an RTX 3090), gradient clipping is often enabled by default in frameworks like Hugging Face Transformers or Axolotl. Operators may adjust the threshold based on loss behavior; a typical starting point is 1.0. Gradient clipping does not affect inference—only training.

Practical example

When fine-tuning Llama 3.1 8B on an RTX 3090 (24 GB VRAM) using QLoRA, the training script might include --gradient_clip_val 1.0. If the loss spikes after a few steps, the operator may lower the clip value to 0.5 or increase the batch size. Without clipping, gradients from a batch with rare tokens could cause a weight update that pushes the model into a high-loss region, requiring a restart.

Workflow example

In Hugging Face Transformers' Trainer, gradient clipping is set via args.max_grad_norm = 1.0. In Axolotl configs, it's gradient_clipping: 1.0. During training, the trainer computes gradients, clips them, then applies the optimizer step. Operators monitoring logs may see 'Gradient norm clipped' warnings if the norm exceeds the threshold. If training diverges, checking gradient norms (e.g., via torch.nn.utils.clip_grad_norm_) helps diagnose the need for clipping.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work