22. Alignment on Consumer GPU

Chapter 22 of 24 · 20 min

Running alignment training requires significant computational resources. This chapter covers practical approaches for alignment on consumer-grade hardware.

Memory Requirements

Full alignment training exceeds consumer GPU capacity:

Component Full RLHF (PPO) DPO/ORPO Minimal
Model (7B) 28GB 14GB 14GB
Optimizer states 56GB 28GB 0 (frozen)
Gradients 14GB 7GB 0
Activations 8GB 4GB 4GB
Total ~106GB ~53GB ~18GB

QLoRA: Quantized Low-Rank Adaptation

from transformers import BitsAndBytesConfig

# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

# LoRA adapters
lora_config = LoraConfig(
    r=64,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

Consumer GPU Training Script

#!/bin/bash
# align_on_consumer_gpu.sh

export CUDA_VISIBLE_DEVICES=0
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

python train_alignment.py \
    --model_name meta-llama/Llama-2-7b-hf \
    --dataset preference_data.json \
    --method dpo \
    --batch_size 4 \
    --gradient_accumulation_steps 4 \
    --learning_rate 1e-4 \
    --epochs 3 \
    --lora_r 64 \
    --use_4bit_quantization \
    --max_seq_length 2048

LoRA Adapter Training

def train_lora_alignment(model, adapter_config, data):
    """Train only LoRA adapters, keep base model frozen."""
    # Only LoRA parameters require gradients
    trainable_params = [p for n, p in model.named_parameters() if "lora_" in n]
    
    optimizer = torch.optim.AdamW(trainable_params, lr=1e-4)
    
    for epoch in range(3):
        for batch in dataloader(data, batch_size=4):
            prompt = batch["prompt"]
            chosen = batch["chosen"]
            rejected = batch["rejected"]
            
            # Forward pass
            logps_chosen = model(prompt, chosen).log_probs
            logps_rejected = model(prompt, rejected).log_probs
            
            # DPO loss
            loss = dpo_loss(logps_chosen, logps_rejected)
            loss.backward()
            
            optimizer.step()
            optimizer.zero_grad()
    
    return model

# Save only LoRA adapters
def save_adapter(model, output_dir):
    model.save_pretrained(output_dir)
    # ~100MB instead of 14GB for full model

Merging Adapters for Inference

def merge_adapter_for_inference(base_model, adapter_path):
    """Merge LoRA adapter into base model for standard inference."""
    from peft import PeftModel
    
    model = AutoModelForCausalLM.from_pretrained(base_model)
    model = PeftModel.from_pretrained(model, adapter_path)
    
    # Merge and unload
    merged_model = model.merge_and_unload()
    
    return merged_model  # Standard model for inference

Training Time Estimates

Configuration GPU Time per Epoch
7B model, full training A100 80GB 2 hours
7B model, QLoRA + LoRA RTX 4090 24GB 8 hours
7B model, QLoRA + LoRA RTX 3090 24GB 10 hours
13B model, QLoRA + LoRA RTX 4090 24GB 16 hours
EXERCISE

Set up QLoRA alignment training on a consumer GPU with a 7B model. Measure throughput, memory usage, and alignment improvement after one epoch.