Training & optimization

Exploding Gradient

An exploding gradient occurs when the gradients used to update model weights during training grow exponentially large, causing unstable updates and numerical overflow. This typically happens in deep networks with many layers, especially when using activation functions like ReLU without proper normalization. In practice, operators encounter this during fine-tuning: if loss spikes to infinity (NaN) or training diverges, exploding gradients are a likely cause. Techniques like gradient clipping (capping gradient norms) and using architectures with residual connections (e.g., Transformers) mitigate the issue.

Practical example

When fine-tuning a Llama 3.1 8B model with Hugging Face Transformers on a single RTX 4090, you might set max_grad_norm=1.0 in the TrainingArguments. Without this, gradients from early layers can explode, causing the loss to jump to NaN within a few steps. With gradient clipping, the training remains stable even with a learning rate of 2e-5.

Workflow example

In a fine-tuning script using transformers.Trainer, you add args = TrainingArguments(..., max_grad_norm=1.0). The trainer clips the gradient norm after backpropagation and before the optimizer step. If you see loss = nan in logs, check for exploding gradients by reducing learning rate or increasing clipping threshold.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work