Learning Rate Schedule
A learning rate schedule adjusts the step size (learning rate) during training to improve convergence and model quality. In local AI, operators fine-tuning models with Hugging Face Transformers or Unsloth use schedules like cosine decay or linear warmup to prevent overshooting minima early and to refine weights later. The schedule is defined by a starting rate, a decay function, and optional warmup steps. Choosing the right schedule matters because a fixed rate can stall or diverge, wasting GPU hours on consumer hardware.
Deeper dive
The learning rate controls how much weights update per batch. A schedule changes this rate over training steps. Common schedules: constant (rarely optimal), step decay (drop rate at fixed intervals), exponential decay (continuous decay), cosine decay (smoothly reduces rate following a cosine curve), and linear warmup (gradually increase rate from zero to initial rate, then decay). Warmup is critical for large models to avoid early instability. Operators fine-tuning Llama 3.1 8B on an RTX 4090 might use a cosine schedule with 100 warmup steps and a peak rate of 2e-5. The schedule is set in the training script (e.g., Transformers' get_cosine_schedule_with_warmup).
Practical example
Fine-tuning a 7B model on a single RTX 3090 (24 GB VRAM) with batch size 1 and gradient accumulation 4. Using a constant learning rate of 2e-5 may cause loss spikes after a few hundred steps. Switching to a cosine schedule with 10% warmup steps (e.g., 200 warmup out of 2000 total steps) smooths training, achieving lower final perplexity. The schedule is defined in the training arguments: lr_scheduler_type='cosine', warmup_ratio=0.1.
Workflow example
In Hugging Face Transformers, the schedule is set via TrainingArguments when using Trainer. For example: TrainingArguments(lr_scheduler_type='cosine', warmup_steps=100, learning_rate=2e-5). In Unsloth, the get_peft_model call uses the same arguments. Operators monitor loss curves in TensorBoard; a well-chosen schedule shows steady decrease without plateaus or spikes. In MLX, the schedule is passed to the optimizer: optimizer = optim.AdamW(lr=lr_schedule) where lr_schedule is a callable returning the rate at each step.
Reviewed by Fredoline Eruo. See our editorial policy.