RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Fine-Tuning with LoRA and QLoRA
  6. /Ch. 13
Fine-Tuning with LoRA and QLoRA

13. Gradient Checkpointing

Chapter 13 of 24 · 20 min
KEY INSIGHT

Fine-tuning large models on consumer hardware often hits a memory wall during backpropagation. Storing all intermediate activations for every layer consumes gigabytes—a 7B parameter model can require 20GB+ just for activations at batch size 1 with fp16. Gradient checkpointing破解es this bottleneck by selectively discarding activations and recomputing them during the backward pass. The technique divides the model into segments. Instead of caching every activation, only the input to each segment gets stored. During backpropagation, the model recomputes the forward pass for each segment to derive the gradients. This trades approximately 20-30% extra compute for roughly 50-70% memory reduction. ```python from transformers import LlamaForCausalLM from torch.distributed.elastic.multiprocessing.errors import recorder model = LlamaForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", device_map="auto", torch_dtype=torch.float16, ) # Enable gradient checkpointing model.gradient_checkpointing_enable() # Verify it's active print(f"Gradient checkpointing: {model.is_gradient_checkpointing()}") ``` The `gradient_checkpointing_enable()` method injects hooks that control activation storage. For custom models, use `torch.utils.checkpoint.checkpoint()` explicitly around layer groups. **Failure mode**: Activations are recomputed on every backward pass, so training becomes slower. If compute time exceeds memory savings, this technique backfires. Profile before committing. **When it matters most**: QLoRA training with 4-bit quantized base models. The base model cannot store gradients anyway, so activation memory dominates the budget. Gradient checkpointing becomes essential when the effective batch size approaches the memory ceiling. ```python # For custom training loops, wrap forward passes from torch.utils.checkpoint import checkpoint def forward_with_checkpoint(module, *args, **kwargs): return checkpoint(module, *args, use_reentrant=False, **kwargs) ``` The `use_reentrant=False` parameter prevents certain gradient accumulation bugs that cause silent failures in distributed training.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

: Measure Memory Impact

import torch
from transformers import AutoModelForCausalLM

# Baseline without checkpointing
model = AutoModelForCausalLM.from_pretrained(
    "gpt2", torch_dtype=torch.float16
)
print(f"Without checkpointing: {model.get_memory_footprint() / 1e9:.2f} GB")

# With checkpointing
model.gradient_checkpointing_enable()
print(f"With checkpointing: {model.get_memory_footprint() / 1e9:.2f} GB")

Run this on gpt2-medium and observe the memory footprint difference. The savings scale with model size.

← Chapter 12
Training Arguments
Chapter 14 →
Training Loop