Training & optimization

Learning Rate

Learning rate is a hyperparameter that controls how much the model's weights are adjusted during each training step. A high learning rate means large weight updates, which can speed up training but risks overshooting the optimal values. A low learning rate makes smaller updates, which is more stable but may require many more steps. In practice, operators fine-tuning models with Hugging Face Transformers or LoRA adapters set this value (e.g., 1e-4 or 5e-5) and often use a scheduler to reduce it over time. The right learning rate balances convergence speed and final model quality.

Deeper dive

The learning rate is a scalar that multiplies the gradient during backpropagation. In stochastic gradient descent (SGD), the weight update is: w_new = w_old - lr * gradient. If lr is too high, the loss may diverge; if too low, training stalls. Common values range from 1e-3 (for Adam on small tasks) to 1e-5 (for fine-tuning large models). Schedulers like cosine annealing or linear decay adjust the rate during training. Operators fine-tuning Llama or Mistral using LoRA often start with lr=2e-4 and use a cosine schedule. The learning rate is one of the most impactful knobs: a wrong choice can waste hours of GPU time.

Practical example

When fine-tuning Llama 3.1 8B with LoRA on a single RTX 4090 (24 GB VRAM), a typical learning rate is 2e-4 with a cosine scheduler. If you set it to 1e-3, the loss may spike and training becomes unstable. If you set it to 1e-6, the model barely changes after 1000 steps. The right rate depends on batch size and model size; for full fine-tuning of a 7B model, rates around 1e-5 are common.

Workflow example

In Hugging Face Transformers, you set the learning rate in the TrainingArguments: TrainingArguments(learning_rate=2e-4, lr_scheduler_type='cosine'). When using Unsloth for LoRA fine-tuning, the default is often 2e-4. In llama.cpp's training mode (e.g., llama-train), you specify --learning-rate 1e-4. If the loss plateaus, operators may lower the rate manually or use a scheduler that decays it every few steps.

Reviewed by Fredoline Eruo. See our editorial policy.