11. Hugging Face Trainer

Chapter 11 of 24 · 20 min

The Hugging Face Trainer class provides a complete training loop for fine-tuning models on Hugging Face datasets. It handles gradient accumulation, checkpoint saving, logging, evaluation, and device management with minimal configuration.

Instantiating a Trainer requires three core components: a model (with LoRA adapters applied), a training arguments configuration, and a dataset prepared with appropriate tokenization. Optional components include evaluation datasets, compute metrics functions, and callbacks for custom behavior.

The model passed to Trainer should already have LoRA adapters configured. The PEFT library provides get_peft_model which wraps a base model with LoRA configuration, returning a model ready for Trainer. The trainer only updates the LoRA parameters; all other parameters remain frozen.

Dataset preparation requires tokenizing input text and computing labels. For instruction-tuning, the label computation typically masks non-response tokens, calculating loss only on the target response portion. This focus accelerates learning and reduces unintended behavior modification.

Gradient checkpointing reduces memory consumption at the cost of additional compute. When enabled, activations are recomputed during backward pass rather than stored during forward pass. This trades speed for memory, enabling larger batch sizes or longer sequences.

Mixed precision training (fp16 or bf16) reduces memory for forward and backward passes while maintaining numerical stability for most fine-tuning tasks. bf16 offers a wider dynamic range than fp16, making it preferable for training stability. The Trainer handles automatic device placement and precision conversion.

EXERCISE

Configure and initialize a complete training setup using Hugging Face Trainer with PEFT LoRA. Set up all required components and verify the model has the correct trainable parameter count.

# trainer_setup.py
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from peft import LoraConfig, get_peft_model, TaskType
from datasets import load_dataset
import torch

def setup_lora_training(
    model_name: str,
    output_dir: str = "./lora_output",
    rank: int = 8,
    learning_rate: float = 3e-4,
    num_epochs: int = 3,
    batch_size: int = 4,
    gradient_accumulation_steps: int = 4
):
    """Set up complete LoRA fine-tuning with Hugging Face Trainer."""
    
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token
    
    # Load model with QLoRA config for memory efficiency
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.bfloat16,
        device_map="auto"
    )
    
    # Configure LoRA
    lora_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        r=rank,
        lora_alpha=2 * rank,  # Scaling factor
        lora_dropout=0.05,
        target_modules=["q_proj", "v_proj"],  # Default for most models
        bias="none"
    )
    
    # Apply LoRA to model
    model = get_peft_model(model, lora_config)
    
    # Print trainable vs total parameters
    model.print_trainable_parameters()
    
    # Load and tokenize dataset
    dataset = load_dataset("yahma/alpaca-cleaned", split="train")
    dataset = dataset.select(range(min(1000, len(dataset))))  # Subset for demo
    
    def tokenize_function(examples):
        # Format as instruction tuning
        formatted = []
        for instruction, input_text, output in zip(
            examples["instruction"],
            examples["input"],
            examples["output"]
        ):
            text = f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n{output}"
            formatted.append(text)
        
        return tokenizer(
            formatted,
            truncation=True,
            max_length=512,
            padding="max_length"
        )
    
    tokenized_dataset = dataset.map(
        tokenize_function,
        batched=True,
        remove_columns=dataset.column_names
    )
    
    # Data collator
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=False  # Causal LM, not masked
    )
    
    # Training arguments
    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=num_epochs,
        per_device_train_batch_size=batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        learning_rate=learning_rate,
        fp16=False,
        bf16=True,  # Use bf16 for stability
        logging_steps=10,
        save_strategy="epoch",
        save_total_limit=2,
        report_to="none",
        warmup_steps=10,
        lr_scheduler_type="cosine"
    )
    
    # Initialize trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset,
        data_collator=data_collator
    )
    
    return trainer, model, tokenizer

# Verify setup
def verify_trainable_params(model):
    """Verify only LoRA parameters are trainable."""
    trainable_params = 0
    all_params = 0
    
    for name, param in model.named_parameters():
        all_params += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    
    print(f"Trainable parameters: {trainable_params:,}")
    print(f"All parameters: {all_params:,}")
    print(f"Trainable percentage: {100 * trainable_params / all_params:.2f}%")
    
    return {
        "trainable": trainable_params,
        "total": all_params,
        "percentage": 100 * trainable_params / all_params
    }