Checkpoint

Definition pending

We've cataloged "Checkpoint" but haven't written a full definition yet. Definitions are hand-curated rather than auto-generated, so it takes time to cover every term.

Want this one prioritized? Email us and we'll bump it.

Practical example

A checkpoint is a snapshot of model weights, optimizer state, and training metadata saved during or after training. It's your insurance policy: if training crashes at step 95,000 of 100,000, you resume from the last checkpoint, not from scratch. Checkpoints also enable model selection — pick the checkpoint with best validation score, not necessarily the last one.

Workflow example

Checkpoint strategy: (1) save every N steps (typically every 500–1000 steps for LLM fine-tuning), (2) keep top-K checkpoints by validation loss (discard the rest to save disk), (3) after training, run full eval on top-3 checkpoints and pick the best, (4) store checkpoints in cloud object storage with versioning, (5) checkpoint size for 7B model: ~14 GB (FP16), ~4 GB (LoRA adapter only). Budget 100–500 GB for a full training run's checkpoints.

Buyer guides

When it doesn't work