22. Alignment on Consumer GPU
Chapter 22 of 24 · 20 min
Running alignment training requires significant computational resources. This chapter covers practical approaches for alignment on consumer-grade hardware.
Memory Requirements
Full alignment training exceeds consumer GPU capacity:
| Component | Full RLHF (PPO) | DPO/ORPO | Minimal |
|---|---|---|---|
| Model (7B) | 28GB | 14GB | 14GB |
| Optimizer states | 56GB | 28GB | 0 (frozen) |
| Gradients | 14GB | 7GB | 0 |
| Activations | 8GB | 4GB | 4GB |
| Total | ~106GB | ~53GB | ~18GB |
QLoRA: Quantized Low-Rank Adaptation
from transformers import BitsAndBytesConfig
# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto"
)
# LoRA adapters
lora_config = LoraConfig(
r=64,
lora_alpha=16,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
Consumer GPU Training Script
#!/bin/bash
# align_on_consumer_gpu.sh
export CUDA_VISIBLE_DEVICES=0
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
python train_alignment.py \
--model_name meta-llama/Llama-2-7b-hf \
--dataset preference_data.json \
--method dpo \
--batch_size 4 \
--gradient_accumulation_steps 4 \
--learning_rate 1e-4 \
--epochs 3 \
--lora_r 64 \
--use_4bit_quantization \
--max_seq_length 2048
LoRA Adapter Training
def train_lora_alignment(model, adapter_config, data):
"""Train only LoRA adapters, keep base model frozen."""
# Only LoRA parameters require gradients
trainable_params = [p for n, p in model.named_parameters() if "lora_" in n]
optimizer = torch.optim.AdamW(trainable_params, lr=1e-4)
for epoch in range(3):
for batch in dataloader(data, batch_size=4):
prompt = batch["prompt"]
chosen = batch["chosen"]
rejected = batch["rejected"]
# Forward pass
logps_chosen = model(prompt, chosen).log_probs
logps_rejected = model(prompt, rejected).log_probs
# DPO loss
loss = dpo_loss(logps_chosen, logps_rejected)
loss.backward()
optimizer.step()
optimizer.zero_grad()
return model
# Save only LoRA adapters
def save_adapter(model, output_dir):
model.save_pretrained(output_dir)
# ~100MB instead of 14GB for full model
Merging Adapters for Inference
def merge_adapter_for_inference(base_model, adapter_path):
"""Merge LoRA adapter into base model for standard inference."""
from peft import PeftModel
model = AutoModelForCausalLM.from_pretrained(base_model)
model = PeftModel.from_pretrained(model, adapter_path)
# Merge and unload
merged_model = model.merge_and_unload()
return merged_model # Standard model for inference
Training Time Estimates
| Configuration | GPU | Time per Epoch |
|---|---|---|
| 7B model, full training | A100 80GB | 2 hours |
| 7B model, QLoRA + LoRA | RTX 4090 24GB | 8 hours |
| 7B model, QLoRA + LoRA | RTX 3090 24GB | 10 hours |
| 13B model, QLoRA + LoRA | RTX 4090 24GB | 16 hours |
EXERCISE
Set up QLoRA alignment training on a consumer GPU with a 7B model. Measure throughput, memory usage, and alignment improvement after one epoch.