Alignment on Consumer GPU — RLHF, DPO, and PPO (Chapter 22)

Running alignment training requires significant computational resources. This chapter covers practical approaches for alignment on consumer-grade hardware.

Memory Requirements

Full alignment training exceeds consumer GPU capacity:

Component	Full RLHF (PPO)	DPO/ORPO	Minimal
Model (7B)	28GB	14GB	14GB
Optimizer states	56GB	28GB	0 (frozen)
Gradients	14GB	7GB	0
Activations	8GB	4GB	4GB
Total	~106GB	~53GB	~18GB

QLoRA: Quantized Low-Rank Adaptation

from transformers import BitsAndBytesConfig

# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

# LoRA adapters
lora_config = LoraConfig(
    r=64,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

Consumer GPU Training Script

#!/bin/bash
# align_on_consumer_gpu.sh

export CUDA_VISIBLE_DEVICES=0
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

python train_alignment.py \
    --model_name meta-llama/Llama-2-7b-hf \
    --dataset preference_data.json \
    --method dpo \
    --batch_size 4 \
    --gradient_accumulation_steps 4 \
    --learning_rate 1e-4 \
    --epochs 3 \
    --lora_r 64 \
    --use_4bit_quantization \
    --max_seq_length 2048

LoRA Adapter Training

def train_lora_alignment(model, adapter_config, data):
    """Train only LoRA adapters, keep base model frozen."""
    # Only LoRA parameters require gradients
    trainable_params = [p for n, p in model.named_parameters() if "lora_" in n]
    
    optimizer = torch.optim.AdamW(trainable_params, lr=1e-4)
    
    for epoch in range(3):
        for batch in dataloader(data, batch_size=4):
            prompt = batch["prompt"]
            chosen = batch["chosen"]
            rejected = batch["rejected"]
            
            # Forward pass
            logps_chosen = model(prompt, chosen).log_probs
            logps_rejected = model(prompt, rejected).log_probs
            
            # DPO loss
            loss = dpo_loss(logps_chosen, logps_rejected)
            loss.backward()
            
            optimizer.step()
            optimizer.zero_grad()
    
    return model

# Save only LoRA adapters
def save_adapter(model, output_dir):
    model.save_pretrained(output_dir)
    # ~100MB instead of 14GB for full model

Merging Adapters for Inference

def merge_adapter_for_inference(base_model, adapter_path):
    """Merge LoRA adapter into base model for standard inference."""
    from peft import PeftModel
    
    model = AutoModelForCausalLM.from_pretrained(base_model)
    model = PeftModel.from_pretrained(model, adapter_path)
    
    # Merge and unload
    merged_model = model.merge_and_unload()
    
    return merged_model  # Standard model for inference

Training Time Estimates

Configuration	GPU	Time per Epoch
7B model, full training	A100 80GB	2 hours
7B model, QLoRA + LoRA	RTX 4090 24GB	8 hours
7B model, QLoRA + LoRA	RTX 3090 24GB	10 hours
13B model, QLoRA + LoRA	RTX 4090 24GB	16 hours