23. Domain Adaptation

Chapter 23 of 24 · 20 min

KEY INSIGHT

Domain adaptation tailors a pre-trained model to a specific domain's conventions, vocabulary, and reasoning patterns. Unlike task-specific fine-tuning (which adds new capabilities), domain adaptation refines existing capabilities for new contexts. The approach depends on data availability and domain distance: **Low-resource domain adaptation**: ```python def adapt_with_retrieval_augmentation(model, domain_knowledge_base, query): """ Augment generation with retrieved domain knowledge. No fine-tuning required. """ retrieved_chunks = domain_knowledge_base.search(query, top_k=5) context = "\n".join(retrieved_chunks) prompt = f"""Based on the following domain knowledge: {context} Answer this query: {query}""" return model.generate(prompt) ``` **Moderate-resource adaptation** with LoRA: ```python def prepare_domain_dataset(domain_texts, tokenizer, block_size=512): """Format domain corpus for causal language modeling.""" def tokenize_function(examples): return tokenizer( examples["text"], truncation=True, max_length=block_size, padding="max_length", return_tensors=None ) dataset = datasets.Dataset.from_dict({"text": domain_texts}) dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text"]) dataset = dataset.train_test_split(test_size=0.1) return dataset ``` **High-resource adaptation** with full fine-tuning of last layers: ```python def freeze_domain_adaptation(model, freeze_layers=24): """ Freeze bottom layers, fine-tune top layers for domain. """ # Freeze embedding and early transformer layers for name, param in model.named_parameters(): layer_num = extract_layer_number(name) if layer_num is not None and layer_num < freeze_layers: param.requires_grad = False elif "lm_head" in name or "layer_norm" in name: # Keep output layers trainable param.requires_grad = True else: param.requires_grad = True trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad) total_params = sum(p.numel() for p in model.parameters()) print(f"Trainable: {trainable_params:,} / {total_params:,} ({100*trainable_params/total_params:.1f}%)") ```

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

: Domain Vocabulary Analysis

def analyze_domain_vocabulary(domain_texts, base_vocab):
    """Identify domain-specific terms missing from base vocabulary."""
    from collections import Counter
    import re
    
    # Tokenize and count
    all_tokens = []
    for text in domain_texts:
        all_tokens.extend(re.findall(r'\b\w+\b', text.lower()))
    
    freq = Counter(all_tokens)
    
    # Find terms absent from base vocab or rare
    oov_terms = []
    for term, count in freq.most_common(5000):
        if term not in base_vocab and count > 50:
            oov_terms.append((term, count))
    
    return oov_terms