RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Fine-Tuning with LoRA and QLoRA
COURSE · BLD · I003

Fine-Tuning with LoRA and QLoRA

Learn fine-tuning with lora and qlora through RunLocalAI's practical lens: finetuning, lora, qlora and training, hardware fit, runtime settings, verification habits and local-vs-cloud tradeoffs.

24 chapters·16h·Builder track·By Fredoline Eruo
PREREQUISITES
  • B002
  • B004

Why this course matters

Fine-Tuning with LoRA and QLoRA is for builders turning local models into working tools, agents and retrieval systems. It connects finetuning, lora, qlora, training and adapters to the questions RunLocalAI wants every reader to answer before they install, upgrade or scale a model: will it run, what will it cost in memory, what setting changes the result, and how do you verify the answer instead of trusting a demo?

What you will be able to do

By the end, you should be able to explain the main tradeoffs in plain language, choose a safe next experiment, and use the chapter exercises as a repeatable operator checklist. The course favors local evidence, hardware fit, context limits, latency and failure modes over generic AI vocabulary.

How to use this course

Start at chapter one if the topic is new. If you already have a working stack, scan for chapters such as Why Fine-Tune?, Fine-Tuning vs RAG, LoRA Theory and Rank Selection and use those lessons as a quality-control pass before changing a workstation, team workflow or production-like local deployment.

CHAPTERS
  1. 01Why Fine-Tune?Fine-tuning modifies a small subset of parameters to instill task-specific behavior while retaining the bulk of pre-trained knowledge.15 min
  2. 02Fine-Tuning vs RAGRAG adds external knowledge at inference time; fine-tuning instills behavioral patterns in model weights.15 min
  3. 03LoRA TheoryLoRA constrains weight updates to low-rank decompositions, reducing trainable parameters by orders of magnitude while preserving fine-tuning effectiveness.15 min
  4. 04Rank SelectionRank selection balances expressiveness against parameter efficiency; moderate ranks (8-16) handle most tasks while higher ranks serve complex behavioral modifications.15 min
  5. 05Target ModulesTargeting attention Q and V projections captures most behavioral modifications; including FFN layers adds capacity for knowledge-heavy adaptations at increased parameter cost.15 min
  6. 06QLoRA: Quantized LoRAQLoRA combines 4-bit base model quantization with higher-precision LoRA adapters, enabling large model fine-tuning on consumer GPUs by reducing base model memory by approximately 75%.20 min
  7. 074-bit NormalFloatNF4 quantization uses non-uniform levels optimized for normal weight distributions, concentrating precision where weights cluster near zero.20 min
  8. 08Dataset PreparationFine-tuning dataset quality determines ceiling performance; even large datasets produce poor models when label noise or formatting inconsistencies are present.15 min
  9. 09Data FormattingTraining format must match inference format; consistent use of special tokens and role markers teaches the model to interpret and generate structured outputs.20 min
  10. 10Chat TemplateChat templates automate consistent formatting across models and datasets using declarative specifications that eliminate manual format string construction.15 min
  11. 11Hugging Face TrainerHugging Face Trainer provides production-ready training infrastructure with LoRA support through PEFT integration, handling distributed training, checkpointing, and evaluation automatically.20 min
  12. 12Training ArgumentsTraining arguments require careful tuning; learning rate, batch size, and scheduler configuration have larger impact on fine-tuning outcomes than most other hyperparameters.20 min
  13. 13Gradient CheckpointingFine-tuning large models on consumer hardware often hits a memory wall during backpropagation. Storing all intermediate activations for every layer consumes gigabytes—a 7B parameter model can require 20GB+ just for activations at batch size 1 with fp16. Gradient checkpointing破解es this bottleneck by selectively discarding activations and recomputing them during the backward pass. The technique divides the model into segments. Instead of caching every activation, only the input to each segment gets stored. During backpropagation, the model recomputes the forward pass for each segment to derive the gradients. This trades approximately 20-30% extra compute for roughly 50-70% memory reduction. ```python from transformers import LlamaForCausalLM from torch.distributed.elastic.multiprocessing.errors import recorder model = LlamaForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", device_map="auto", torch_dtype=torch.float16, ) # Enable gradient checkpointing model.gradient_checkpointing_enable() # Verify it's active print(f"Gradient checkpointing: {model.is_gradient_checkpointing()}") ``` The `gradient_checkpointing_enable()` method injects hooks that control activation storage. For custom models, use `torch.utils.checkpoint.checkpoint()` explicitly around layer groups. **Failure mode**: Activations are recomputed on every backward pass, so training becomes slower. If compute time exceeds memory savings, this technique backfires. Profile before committing. **When it matters most**: QLoRA training with 4-bit quantized base models. The base model cannot store gradients anyway, so activation memory dominates the budget. Gradient checkpointing becomes essential when the effective batch size approaches the memory ceiling. ```python # For custom training loops, wrap forward passes from torch.utils.checkpoint import checkpoint def forward_with_checkpoint(module, *args, **kwargs): return checkpoint(module, *args, use_reentrant=False, **kwargs) ``` The `use_reentrant=False` parameter prevents certain gradient accumulation bugs that cause silent failures in distributed training.20 min
  14. 14Training LoopThe training loop is the execution engine that translates hyperparameter choices into learned weights. A reliable implementation handles data loading, gradient computation, optimizer updates, and logging with clean separation of concerns. The core sequence: load a batch → compute loss → backpropagate → update weights → repeat. Each iteration must preserve numerical stability and avoid gradient accumulation errors that produce silent failures. ```python from torch.utils.data import DataLoader from transformers import get_linear_schedule_with_warmup from torch.optim import AdamW import torch.nn.functional as F def train_epoch(model, dataloader, optimizer, scheduler, device, gradient_accumulation_steps=4): model.train() total_loss = 0 optimizer.zero_grad() for step, batch in enumerate(dataloader): input_ids = batch["input_ids"].to(device) attention_mask = batch["attention_mask"].to(device) labels = batch["labels"].to(device) outputs = model( input_ids=input_ids, attention_mask=attention_mask, labels=labels ) # Scale loss for gradient accumulation loss = outputs.loss / gradient_accumulation_steps loss.backward() if (step + 1) % gradient_accumulation_steps == 0: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) optimizer.step() scheduler.step() optimizer.zero_grad() total_loss += outputs.loss.detach().cpu().item() return total_loss / len(dataloader) ``` Gradient clipping prevents exploding gradients—a common failure mode with LoRA adapters on certain data distributions. The threshold of 1.0 works for most transformer architectures, but verify with your specific use case. **Optimizer selection**: AdamW with weight decay handles the regularization correctly. For LoRA, bias and layer norm parameters typically get zero weight decay. QLoRA introduces the gradient scaling factor that must be accounted for in loss calculations. ```python # Correct optimizer setup for LoRA optimizer = AdamW( [{"params": model.lora_parameters, "weight_decay": 0.01}, {"params": model.other_parameters, "weight_decay": 0.1}], lr=2e-4 ) ```20 min
  15. 15Monitoring TrainingTraining a fine-tuned model without monitoring is like flying blind. Loss curves alone don't reveal whether the model is learning the right patterns or degrading unexpectedly. Detailed monitoring catches divergence before hours of training time get wasted. Key metrics to track: training loss, validation loss, learning rate schedule, gradient norms, and token-level accuracy. Loss alone is insufficient—a plateauing loss might mask a distribution shift in generated outputs. ```python from torch.utils.tensorboard import SummaryWriter import matplotlib.pyplot as plt class TrainingMonitor: def __init__(self, log_dir="./logs"): self.writer = SummaryWriter(log_dir) self.history = {"train_loss": [], "val_loss": [], "grad_norm": []} def log_step(self, step, metrics): for key, value in metrics.items(): self.writer.add_scalar(key, value, step) self.history[key].append(value) def check_health(self, step): """Detect common training pathologies.""" issues = [] # Gradient explosion if self.history["grad_norm"][-1] > 100: issues.append("Exploding gradients detected") # Loss divergence if len(self.history["val_loss"]) > 5: recent = self.history["val_loss"][-5:] if all(recent[i] < recent[i+1] for i in range(len(recent)-1)): issues.append("Validation loss diverging") # Learning rate issues if self.history["grad_norm"][-1] < 0.1: issues.append("Gradients vanishing - LR may be too low") return issues ``` **Gradient norm tracking** reveals training stability. Normal transformer training produces gradient norms between 0.5 and 5.0. Values outside this range typically indicate configuration errors. ```python # Log gradient norms during training def log_gradients(model, step, writer): total_norm = 0 for p in model.parameters(): if p.grad is not None: param_norm = p.grad.data.norm(2) total_norm += param_norm.item() ** 2 total_norm = total_norm ** 0.5 writer.add_scalar("grad_norm", total_norm, step) ``` **Failure mode**: NaN losses often emerge from numerical overflow in mixed-precision training. The fix is typically reducing the learning rate or enabling loss scaling: ```python scaler = torch.cuda.amp.GradScaler() # In training loop with torch.cuda.amp.autocast(): outputs = model(**batch) loss = outputs.loss scaler.scale(loss).backward() scaler.unscale_(optimizer) torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) scaler.step(optimizer) scaler.update() ```20 min
  16. 16EvaluationEvaluation determines whether fine-tuning achieved its intended goal. Generic perplexity scores don't capture task-specific performance. A model with excellent perplexity might fail at the actual use case while another with worse perplexity performs exactly as needed. Design evaluation to match downstream behavior. If the model will answer questions, evaluate question answering. If it will generate code, evaluate code generation with unit tests. Metrics must proxy real-world success. ```python from datasets import load_metric import torch def evaluate_model(model, dataloader, tokenizer, device): model.eval() metrics = {"loss": [], "accuracy": []} with torch.no_grad(): for batch in dataloader: input_ids = batch["input_ids"].to(device) attention_mask = batch["attention_mask"].to(device) labels = batch["labels"].to(device) outputs = model(input_ids, attention_mask=attention_mask) logits = outputs.logits # Perplexity loss = torch.nn.functional.cross_entropy( logits.view(-1, logits.size(-1)), labels.view(-1) ) metrics["loss"].append(loss.item()) # Token accuracy predictions = logits.argmax(dim=-1) mask = labels != tokenizer.pad_token_id correct = (predictions == labels) & mask acc = correct.sum().item() / mask.sum().item() metrics["accuracy"].append(acc) return {k: sum(v) / len(v) for k, v in metrics.items()} ``` **Task-specific evaluation** requires different metrics: | Task | Metrics | |------|---------| | Classification | F1, Precision, Recall, AUROC | | Generation | BLEU, ROUGE, BERTScore | | Code Generation | Pass@k, Compilation Rate | | Summarization | ROUGE-L, Factuality | ```python # Example: Code generation evaluation def evaluate_code_generation(model, test_cases, tokenizer, device): results = {"compiled": 0, "passed": 0} for prompt, expected_output in test_cases: generated = generate_code(model, prompt, tokenizer, device) try: # Attempt to compile compile(generated, "<string>", "exec") results["compiled"] += 1 # Check output exec_output = capture_stdout(generated) if exec_output.strip() == expected_output.strip(): results["passed"] += 1 except: pass return {k: v / len(test_cases) for k, v in results.items()} ``` **Failure mode**: Data contamination. If evaluation samples appear in training data, metrics will be artificially inflated. Use strict data separation and report both in-distribution and out-of-distribution performance.20 min
  17. 17Adapter ManagementLoRA adapters create modular upgrades to base models. Managing multiple adapters requires systematic organization—version control, naming conventions, and loading mechanisms that prevent conflicts. Each adapter consists of the learned A and B matrices plus metadata about training configuration, dataset, and purpose. Storing these together enables reproducibility and deployment flexibility. ```python import os import json from pathlib import Path class AdapterRegistry: def __init__(self, registry_dir="./adapters"): self.registry_dir = Path(registry_dir) self.registry_file = self.registry_dir / "registry.json" self.adapters = self.load_registry() def load_registry(self): if self.registry_file.exists(): with open(self.registry_file) as f: return json.load(f) return {"adapters": []} def register(self, adapter_id, adapter_path, metadata): entry = { "id": adapter_id, "path": str(adapter_path), "metadata": metadata } self.adapters["adapters"].append(entry) self.save() def save(self): self.registry_dir.mkdir(parents=True, exist_ok=True) with open(self.registry_file, "w") as f: json.dump(self.adapters, f, indent=2) def list_adapters(self, filter_fn=None): adapters = self.adapters["adapters"] if filter_fn: return [a for a in adapters if filter_fn(a)] return adapters ``` **Loading multiple adapters** requires careful weight management: ```python from peft import PeftModel, PeftConfig def load_adapter(base_model, adapter_path, adapter_name="default"): """Load a single adapter onto the base model.""" model = PeftModel.from_pretrained( base_model, adapter_path, adapter_name=adapter_name ) return model def load_multiple_adapters(base_model, adapter_configs): """Load multiple adapters, each with unique names.""" model = base_model for config in adapter_configs: model = PeftModel.from_pretrained( model, config["path"], adapter_name=config["name"] ) return model ``` **Adapter merging vs. switching**: Running multiple adapters simultaneously requires either merging weights (slower inference, more flexible) or routing (faster inference, requires routing logic). ```python # Switch between adapters without reloading base model def switch_adapter(model, adapter_name): model.set_adapter(adapter_name) return model ```20 min
  18. 18Merging AdaptersAdapter merging consolidates learned weights into the base model, eliminating runtime overhead from the adapter architecture. The merged model behaves identically to running with the adapter active but requires no special loading logic. Merging is essential for deployment scenarios where inference latency matters. An adapter with rank 8 adds 8× reduction matrix multiplications per attention layer—overhead that accumulates at scale. ```python from peft import PeftModel import torch def merge_adapter_to_base(model, adapter_path, output_path): """ Merge a LoRA adapter into its base model. After merging, the model contains the full weights. """ # Load base model base_model = model.base_model.model # Load adapter model = PeftModel.from_pretrained(model, adapter_path) # Merge weights merged_model = model.merge_and_unload() # Save merged model merged_model.save_pretrained(output_path) return merged_model ``` **Merging order matters** with multiple adapters. The mathematical composition depends on whether adapters should blend or replace each other: ```python # Sequential merge: adapter_a + adapter_b = combined def merge_multiple(base_model_path, adapter_paths, output_path): base_model = AutoModelForCausalLM.from_pretrained(base_model_path) model = base_model for adapter_path in adapter_paths: model = PeftModel.from_pretrained(model, adapter_path) model.merge_and_unload() model.save_pretrained(output_path) return model ``` **Weighted merging** combines adapters by contribution: ```python def weighted_merge(model, adapter_weights): """ Merge multiple adapters with weighted averaging. adapter_weights: dict of {adapter_name: weight} """ # Load all adapters model = load_multiple_adapters(model.base_model.model, [ {"path": path, "name": name} for name, path in adapter_weights.items() ]) # Weighted average in LoRA space with torch.no_grad(): for name, param in model.named_parameters(): if "lora_" in name: # Extract adapter weights and combine pass # Implementation depends on weight structure return model.merge_and_unload() ``` **Failure mode**: Merging can cause numerical instability if adapter weights have different scales. Check that merged model outputs are numerically similar to adapter outputs before deployment.20 min
  19. 19GGUF ConversionGGUF (GPT Generative Unified Format) is a quantized model format designed for efficient inference. Converting fine-tuned adapters to GGUF allows deployment on CPU and low-memory systems without sacrificing the specialized capabilities learned during fine-tuning. The conversion pipeline merges adapters into the base model, then quantizes to the target precision. The resulting file can be memory-mapped for zero-copy loading. ```bash # Install llama.cpp tools git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp mkdir build && cd build cmake .. make -j$(nproc) ``` ```python import subprocess from pathlib import Path def convert_to_gguf(base_model_path, adapter_path, output_path, quantization="Q4_K_M"): """ Convert a LoRA fine-tuned model to GGUF format. Steps: 1. Merge adapter into base model 2. Export to GGUF format 3. Quantize to target precision """ merged_path = Path("./tmp_merged") # Step 1: Merge base_model = AutoModelForCausalLM.from_pretrained(base_model_path) adapter_model = PeftModel.from_pretrained(base_model, adapter_path) merged_model = adapter_model.merge_and_unload() merged_model.save_pretrained(merged_path) # Step 2: Export to GGUF using llama.cpp cmd = [ "./llama.cpp/convert.py", str(merged_path), "--outfile", f"{output_path}.fp16.gguf", "--outtype", "f16" ] subprocess.run(cmd, check=True) # Step 3: Quantize quantize_cmd = [ "./llama.cpp/build/bin/quantize", f"{output_path}.fp16.gguf", f"{output_path}.{quantization}.gguf", quantization ] subprocess.run(quantize_cmd, check=True) # Cleanup shutil.rmtree(merged_path) Path(f"{output_path}.fp16.gguf").unlink() ``` **Quantization levels** trade accuracy for size: | Format | Size (7B) | Accuracy | |--------|-----------|----------| | FP16 | ~14GB | Baseline | | Q5_K_M | ~4.5GB | ~98% | | Q4_K_M | ~3.8GB | ~97% | | Q3_K_M | ~3.0GB | ~95% | | Q2_K | ~2.7GB | ~93% | ```bash # Manual conversion workflow python llama.cpp/convert.py ./merged_model --outfile model_fp16.gguf --outtype f16 ./llama.cpp/build/bin/quantize model_fp16.gguf model_q4_k_m.gguf Q4_K_M ```20 min
  20. 20Inference with Fine-Tuned ModelsDeploying fine-tuned models requires different optimization priorities than training. Memory footprint, latency, and throughput dominate considerations. The base model architecture and quantization level fundamentally determine inference characteristics. Loading patterns matter. Models can be loaded fully into memory, memory-mapped for zero-copy access, or streamed in chunks. Each approach trades startup time against operational memory. ```python from llama_cpp import Llama from transformers import AutoTokenizer import torch class FineTunedInferenceEngine: def __init__(self, model_path, tokenizer_path=None, n_ctx=2048, n_gpu_layers=0): self.model_path = model_path self.n_ctx = n_ctx # Load model self.model = Llama( model_path=model_path, n_ctx=n_ctx, n_gpu_layers=n_gpu_layers, # 0 = CPU only use_mmap=True, # Memory mapping for large models use_mlock=False, # Don't lock in RAM ) # Load tokenizer if tokenizer_path: self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_path) def generate(self, prompt, max_tokens=256, temperature=0.7, top_p=0.9): """Generate text with standard sampling parameters.""" return self.model( prompt, max_tokens=max_tokens, temperature=temperature, top_p=top_p, repeat_penalty=1.1, stop=["</s>", "User:", "\n\n\n"] ) ``` **Batching strategies** dramatically affect throughput: ```python def batch_generate(engine, prompts, max_tokens=128): """Generate for multiple prompts in a batch.""" if hasattr(engine.model, "create_batch"): # Native batching support return engine.model.create_batch(prompts, max_tokens=max_tokens) else: # Sequential with caching results = [] for prompt in prompts: results.append(engine.generate(prompt, max_tokens=max_tokens)) return results ``` **Failure mode**: Prompt injection in production. Fine-tuned models may be more susceptible to instruction following that bypasses safety measures. Always implement input validation and output filtering. ```python class SafeInferenceEngine(FineTunedInferenceEngine): def generate(self, prompt, **kwargs): # Validate input if len(prompt) > self.n_ctx * 4: raise ValueError("Input too long") if self.contains_dangerous_patterns(prompt): return {"error": "Unsafe prompt detected"} return super().generate(prompt, **kwargs) @staticmethod def contains_dangerous_patterns(prompt): dangerous = ["<|system|>", "[INST]", "{{system"] return any(p in prompt.lower() for p in dangerous) ```20 min
  21. 21Catastrophic ForgettingCatastrophic forgetting occurs when fine-tuning overwrites pre-trained capabilities. A model that excelled at general reasoning may lose those skills after training on a narrow domain. This represents one of the fundamental challenges in transfer learning. The mechanism: gradient updates during fine-tuning modify weights that were crucial for pre-trained behaviors. Without regularization toward the original function, the model drifts toward the fine-tuning distribution. **Diagnosis**: Evaluate on both pre-training tasks and fine-tuning tasks before and after training. ```python def diagnose_forgetting(before_model, after_model, benchmark_datasets): """ Compare model performance before and after fine-tuning. Returns degradation scores for each capability. """ results = {} for dataset_name, dataset in benchmark_datasets.items(): # Test original model orig_metrics = evaluate_model(before_model, dataset) # Test fine-tuned model new_metrics = evaluate_model(after_model, dataset) degradation = { metric: (orig_metrics[metric] - new_metrics[metric]) / orig_metrics[metric] for metric in orig_metrics } results[dataset_name] = { "before": orig_metrics, "after": new_metrics, "degradation_pct": degradation } return results ``` **Mitigation strategies**: **1. Regularization toward pre-trained weights:** ```python def lora_with_distillation(model, lora_model, teacher_model, alpha=0.5): """ Combine task loss with knowledge distillation from original model. """ def forward_hook(module, input, output): # Compute KL divergence between student and teacher logits student_logits = output with torch.no_grad(): teacher_logits = teacher_model(module.input[0]) kl_loss = F.kl_div( F.log_softmax(student_logits, dim=-1), F.softmax(teacher_logits, dim=-1), reduction="batchmean" ) return kl_loss * alpha return forward_hook ``` **2. Mixed fine-tuning with pre-training data:** ```python def create_mixed_dataset(task_data, pretrain_data, ratio=0.2): """Mix task-specific data with general pre-training data.""" sampled_pretrain = pretrain_data.shuffle().select( range(int(len(task_data) * ratio)) ) return datasets.concatenate_datasets([task_data, sampled_pretrain]) ``` **3. Smaller learning rate with more epochs:** ```python # Conservative fine-tuning hyperparameters hyperparameters = { "learning_rate": 5e-5, # Much lower than default "warmup_ratio": 0.1, # Gradual warmup "weight_decay": 0.1, # Stronger regularization "num_epochs": 3, # More epochs with lower LR } ```20 min
  22. 22Multi-Task Fine-TuningMulti-task fine-tuning trains a single model on multiple tasks simultaneously, potentially enabling generalization and reducing deployment complexity. The challenge lies in balancing task contributions and preventing negative transfer where one task degrades another. The dataset format must accommodate task identifiers or the model must learn to route based on input patterns. Each approach has tradeoffs in flexibility and complexity. ```python class MultiTaskDataset(Dataset): def __init__(self, task_configs): """ task_configs: list of dicts with 'task_name', 'dataset', 'prompt_template' """ self.datasets = [] for config in task_configs: dataset = load_dataset(**config["dataset"]) self.datasets.append({ "name": config["task_name"], "dataset": dataset, "template": config["prompt_template"] }) def __len__(self): return sum(len(ds["dataset"]) for ds in self.datasets) def __getitem__(self, idx): # Find which dataset contains this index for ds in self.datasets: if idx < len(ds["dataset"]): item = ds["dataset"][idx] return { "task": ds["name"], "input": ds["template"].format(**item), "label": item.get("label", item.get("output")) } idx -= len(ds["dataset"]) ``` **Task weighting strategies** address imbalance between large and small task datasets: ```python def create_balanced_batch(dataset, batch_size, temperature=2.0): """ Sample tasks with temperature-based weighting. Smaller datasets get higher probability. """ task_sizes = {ds["name"]: len(ds["dataset"]) for ds in dataset.datasets} weights = {k: v ** (1/temperature) for k, v in task_sizes.items()} total = sum(weights.values()) weights = {k: v/total for k, v in weights.items()} # Sample task, then sample from that task's dataset tasks = list(weights.keys()) probs = list(weights.values()) chosen_task = random.choices(tasks, weights=probs, k=batch_size) # Build batch from chosen task batch = [] for task_name in chosen_task: sample = dataset.sample_from_task(task_name) batch.append(sample) return collate_fn(batch) ``` **Gradient balancing** prevents dominant tasks from overwhelming small ones: ```python class GradientBalancer: def __init__(self, task_weights, alpha=0.5): self.task_weights = task_weights # Dynamic weights self.alpha = alpha # Balancing strength def compute_balanced_loss(self, losses, task_names): """ Apply gradient regularization to balance task contributions. """ base_loss = sum(losses) / len(losses) # Encourage task-specific heads to learn independently regularization = self.alpha * sum( torch.std(l) for l in losses if l.requires_grad ) return base_loss + regularization ```20 min
  23. 23Domain AdaptationDomain adaptation tailors a pre-trained model to a specific domain's conventions, vocabulary, and reasoning patterns. Unlike task-specific fine-tuning (which adds new capabilities), domain adaptation refines existing capabilities for new contexts. The approach depends on data availability and domain distance: **Low-resource domain adaptation**: ```python def adapt_with_retrieval_augmentation(model, domain_knowledge_base, query): """ Augment generation with retrieved domain knowledge. No fine-tuning required. """ retrieved_chunks = domain_knowledge_base.search(query, top_k=5) context = "\n".join(retrieved_chunks) prompt = f"""Based on the following domain knowledge: {context} Answer this query: {query}""" return model.generate(prompt) ``` **Moderate-resource adaptation** with LoRA: ```python def prepare_domain_dataset(domain_texts, tokenizer, block_size=512): """Format domain corpus for causal language modeling.""" def tokenize_function(examples): return tokenizer( examples["text"], truncation=True, max_length=block_size, padding="max_length", return_tensors=None ) dataset = datasets.Dataset.from_dict({"text": domain_texts}) dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text"]) dataset = dataset.train_test_split(test_size=0.1) return dataset ``` **High-resource adaptation** with full fine-tuning of last layers: ```python def freeze_domain_adaptation(model, freeze_layers=24): """ Freeze bottom layers, fine-tune top layers for domain. """ # Freeze embedding and early transformer layers for name, param in model.named_parameters(): layer_num = extract_layer_number(name) if layer_num is not None and layer_num < freeze_layers: param.requires_grad = False elif "lm_head" in name or "layer_norm" in name: # Keep output layers trainable param.requires_grad = True else: param.requires_grad = True trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad) total_params = sum(p.numel() for p in model.parameters()) print(f"Trainable: {trainable_params:,} / {total_params:,} ({100*trainable_params/total_params:.1f}%)") ```20 min
  24. 24Domain Fine-Tune ProjectThis chapter synthesizes the course content through a complete domain adaptation project: fine-tuning a model for legal document analysis. The project demonstrates data preparation, LoRA configuration, training with gradient checkpointing, evaluation, and deployment preparation. **Project goal**: Fine-tune a 7B parameter model to summarize and extract key provisions from contracts. ### Phase 1: Data Preparation ```python from datasets import load_dataset, load_from_disk from transformers import AutoTokenizer import json def prepare_contract_dataset(output_path="./data/contracts"): """ Load and format contract dataset. Expected format: {"text": contract_text, "summary": summary} """ # Load raw contracts contracts = load_dataset("json", data_files="contracts_raw.jsonl", split="train") def format_contract(example): return { "text": f"### Contract\n{example['text']}\n\n### Summary\n", "summary": example["summary"] } contracts = contracts.map(format_contract, remove_columns=contracts.column_names) contracts.save_to_disk(output_path) return contracts def tokenize_contracts(examples, tokenizer, max_length=2048): """Tokenize for causal language modeling.""" result = tokenizer( examples["text"] + examples["summary"], truncation=True, max_length=max_length, padding="max_length" ) # Labels are the same as input_ids for causal LM result["labels"] = result["input_ids"].copy() return result ``` ### Phase 2: LoRA Configuration ```python from peft import LoraConfig, get_peft_model, TaskType def create_contract_model(model_name="meta-llama/Llama-2-7b-hf"): """Initialize model with LoRA for contract analysis.""" # Load base model model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, device_map="auto", load_in_4bit=True, ) # Enable gradient checkpointing for memory efficiency model.gradient_checkpointing_enable() # LoRA configuration optimized for 7B models lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=16, lora_alpha=32, lora_dropout=0.05, target_modules=[ "q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj" ], bias="none", ) model = get_peft_model(model, lora_config) trainable_params, total_params = get_trainable_stats(model) print(f"Trainable: {trainable_params:,} / {total_params:,} ({100*trainable_params/total_params:.2f}%)") return model def get_trainable_stats(model): trainable = sum(p.numel() for p in model.parameters() if p.requires_grad) total = sum(p.numel() for p in model.parameters()) return trainable, total ``` ### Phase 3: Training Loop ```python from torch.utils.data import DataLoader from torch.optim import AdamW from transformers import get_linear_schedule_with_warmup from accelerate import Accelerator import torch.nn.functional as F def train_contract_model( model, train_dataset, eval_dataset, output_dir="./contract_model", epochs=3, batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-4, ): """Complete training pipeline for contract model.""" accelerator = Accelerator(mixed_precision="fp16") # Create dataloaders train_loader = DataLoader( train_dataset, batch_size=batch_size, shuffle=True, pin_memory=True ) eval_loader = DataLoader(eval_dataset, batch_size=batch_size) # Optimizer and scheduler optimizer = AdamW(model.parameters(), lr=learning_rate, weight_decay=0.01) total_steps = len(train_loader) * epochs // gradient_accumulation_steps scheduler = get_linear_schedule_with_warmup( optimizer, num_warmup_steps=int(0.1 * total_steps), num_training_steps=total_steps ) # Prepare with accelerator model, optimizer, train_loader, eval_loader, scheduler = accelerator.prepare( model, optimizer, train_loader, eval_loader, scheduler ) # Training loop for epoch in range(epochs): model.train() for step, batch in enumerate(train_loader): with accelerator.autocast(): outputs = model(**batch) loss = outputs.loss / gradient_accumulation_steps accelerator.backward(loss) if (step + 1) % gradient_accumulation_steps == 0: accelerator.clip_grad_norm_(model.parameters(), 1.0) optimizer.step() scheduler.step() optimizer.zero_grad() if step % 100 == 0: print(f"Epoch {epoch}, Step {step}, Loss: {loss.item():.4f}") # Evaluation model.eval() eval_loss = 0 for batch in eval_loader: with torch.no_grad(): outputs = model(**batch) eval_loss += outputs.loss.item() print(f"Epoch {epoch}: eval_loss={eval_loss/len(eval_loader):.4f}") # Save checkpoint accelerator.wait_for_everyone() unwrapped = accelerator.unwrap_model(model) unwrapped.save_pretrained(output_dir) # Execute training if __name__ == "__main__": tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf") train_data = prepare_contract_dataset() tokenized_train = train_data.map( lambda x: tokenize_contracts(x, tokenizer), batched=True ) model = create_contract_model() train_contract_model(model, tokenized_train["train"], tokenized_train["test"]) ``` ### Phase 4: Evaluation ```python def evaluate_contract_model(model, test_cases): """ Evaluate on held-out contract summaries. Metrics: ROUGE-L, factual consistency, completeness """ from rouge import Rouge rouge = Rouge() results = {"rouge-l": [], "consistency": []} for contract, reference in test_cases: # Generate summary prompt = f"### Contract\n{contract}\n\n### Summary\n" generated = model.generate(prompt, max_new_tokens=200) # ROUGE score scores = rouge.get_scores(generated, reference) results["rouge-l"].append(scores[0]["rouge-l"]["f"]) return { "rouge-l": sum(results["rouge-l"]) / len(results["rouge-l"]), "mean_length": sum(len(s.split()) for s in generated) / len(test_cases) } ``` ### Phase 5: Export for Deployment ```python def export_contract_model(adapter_path, output_path): """Convert fine-tuned model to deployment-ready format.""" # Merge adapter base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf") model = PeftModel.from_pretrained(base_model, adapter_path) merged = model.merge_and_unload() # Save in standard format merged.save_pretrained(output_path) # Export to GGUF for CPU inference subprocess.run([ "python", "llama.cpp/convert.py", output_path, "--outfile", f"{output_path}.gguf", "--outtype", "f16" ]) ```25 min
← All coursesStart chapter 1 →