RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Fine-Tuning with LoRA and QLoRA
  6. /Ch. 15
Fine-Tuning with LoRA and QLoRA

15. Monitoring Training

Chapter 15 of 24 · 20 min
KEY INSIGHT

Training a fine-tuned model without monitoring is like flying blind. Loss curves alone don't reveal whether the model is learning the right patterns or degrading unexpectedly. Detailed monitoring catches divergence before hours of training time get wasted. Key metrics to track: training loss, validation loss, learning rate schedule, gradient norms, and token-level accuracy. Loss alone is insufficient—a plateauing loss might mask a distribution shift in generated outputs. ```python from torch.utils.tensorboard import SummaryWriter import matplotlib.pyplot as plt class TrainingMonitor: def __init__(self, log_dir="./logs"): self.writer = SummaryWriter(log_dir) self.history = {"train_loss": [], "val_loss": [], "grad_norm": []} def log_step(self, step, metrics): for key, value in metrics.items(): self.writer.add_scalar(key, value, step) self.history[key].append(value) def check_health(self, step): """Detect common training pathologies.""" issues = [] # Gradient explosion if self.history["grad_norm"][-1] > 100: issues.append("Exploding gradients detected") # Loss divergence if len(self.history["val_loss"]) > 5: recent = self.history["val_loss"][-5:] if all(recent[i] < recent[i+1] for i in range(len(recent)-1)): issues.append("Validation loss diverging") # Learning rate issues if self.history["grad_norm"][-1] < 0.1: issues.append("Gradients vanishing - LR may be too low") return issues ``` **Gradient norm tracking** reveals training stability. Normal transformer training produces gradient norms between 0.5 and 5.0. Values outside this range typically indicate configuration errors. ```python # Log gradient norms during training def log_gradients(model, step, writer): total_norm = 0 for p in model.parameters(): if p.grad is not None: param_norm = p.grad.data.norm(2) total_norm += param_norm.item() ** 2 total_norm = total_norm ** 0.5 writer.add_scalar("grad_norm", total_norm, step) ``` **Failure mode**: NaN losses often emerge from numerical overflow in mixed-precision training. The fix is typically reducing the learning rate or enabling loss scaling: ```python scaler = torch.cuda.amp.GradScaler() # In training loop with torch.cuda.amp.autocast(): outputs = model(**batch) loss = outputs.loss scaler.scale(loss).backward() scaler.unscale_(optimizer) torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) scaler.step(optimizer) scaler.update() ```

EXERCISE

: Implement Custom Metrics

Create a monitoring system that tracks per-token accuracy alongside loss:

def compute_token_accuracy(logits, labels, pad_token_id=0):
    predictions = logits.argmax(dim=-1)
    mask = labels != pad_token_id
    correct = (predictions == labels) & mask
    accuracy = correct.sum().item() / mask.sum().item()
    return accuracy

Integrate this into your training loop and visualize accuracy curves alongside loss curves in TensorBoard.

← Chapter 14
Training Loop
Chapter 16 →
Evaluation