13. Gradient Checkpointing

Chapter 13 of 24 · 20 min

KEY INSIGHT

Fine-tuning large models on consumer hardware often hits a memory wall during backpropagation. Storing all intermediate activations for every layer consumes gigabytesâ€”a 7B parameter model can require 20GB+ just for activations at batch size 1 with fp16. Gradient checkpointingç ´è§£es this bottleneck by selectively discarding activations and recomputing them during the backward pass. The technique divides the model into segments. Instead of caching every activation, only the input to each segment gets stored. During backpropagation, the model recomputes the forward pass for each segment to derive the gradients. This trades approximately 20-30% extra compute for roughly 50-70% memory reduction. ```python from transformers import LlamaForCausalLM from torch.distributed.elastic.multiprocessing.errors import recorder model = LlamaForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", device_map="auto", torch_dtype=torch.float16, ) # Enable gradient checkpointing model.gradient_checkpointing_enable() # Verify it's active print(f"Gradient checkpointing: {model.is_gradient_checkpointing()}") ``` The `gradient_checkpointing_enable()` method injects hooks that control activation storage. For custom models, use `torch.utils.checkpoint.checkpoint()` explicitly around layer groups. **Failure mode**: Activations are recomputed on every backward pass, so training becomes slower. If compute time exceeds memory savings, this technique backfires. Profile before committing. **When it matters most**: QLoRA training with 4-bit quantized base models. The base model cannot store gradients anyway, so activation memory dominates the budget. Gradient checkpointing becomes essential when the effective batch size approaches the memory ceiling. ```python # For custom training loops, wrap forward passes from torch.utils.checkpoint import checkpoint def forward_with_checkpoint(module, *args, **kwargs): return checkpoint(module, *args, use_reentrant=False, **kwargs) ``` The `use_reentrant=False` parameter prevents certain gradient accumulation bugs that cause silent failures in distributed training.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

: Measure Memory Impact

import torch
from transformers import AutoModelForCausalLM

# Baseline without checkpointing
model = AutoModelForCausalLM.from_pretrained(
    "gpt2", torch_dtype=torch.float16
)
print(f"Without checkpointing: {model.get_memory_footprint() / 1e9:.2f} GB")

# With checkpointing
model.gradient_checkpointing_enable()
print(f"With checkpointing: {model.get_memory_footprint() / 1e9:.2f} GB")

Run this on gpt2-medium and observe the memory footprint difference. The savings scale with model size.