23. Domain Adaptation
Chapter 23 of 24 · 20 min
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
EXERCISE
: Domain Vocabulary Analysis
def analyze_domain_vocabulary(domain_texts, base_vocab):
"""Identify domain-specific terms missing from base vocabulary."""
from collections import Counter
import re
# Tokenize and count
all_tokens = []
for text in domain_texts:
all_tokens.extend(re.findall(r'\b\w+\b', text.lower()))
freq = Counter(all_tokens)
# Find terms absent from base vocab or rare
oov_terms = []
for term, count in freq.most_common(5000):
if term not in base_vocab and count > 50:
oov_terms.append((term, count))
return oov_terms