09. DSPy Optimizers
DSPy optimizers tune prompts and language model configurations automatically. The optimizer receives a program (a sequence of modules), training data with labels, and an evaluation metric. It searches for the configuration that maximizes the metric.
The basic pattern:
import dspy
from dspy.teleprompt import BootstrapFewShot
# Assume we have a labeled dataset
trainset = [
dspy.Example(
text="CEO John Morrison announced Q3 earnings exceeded expectations by 12%",
name="John Morrison",
role="CEO",
organization="Unnamed Company"
).with_inputs('text'),
# ... more examples
]
# The signature we want to optimize
class ExtractPersonInfo(dspy.Signature):
text = dspy.InputField()
name = dspy.OutputField()
role = dspy.OutputField()
organization = dspy.OutputField()
# Optimizer setup
optimizer = BootstrapFewShot(
metric=lambda example, pred, trace: (
pred.name.strip().lower() == example.name.strip().lower() and
pred.role.strip().lower() == example.role.strip().lower()
),
max_bootstrapped_demos=4,
max_labeled_demos=4
)
# Create the module
extract = dspy.Predict(ExtractPersonInfo)
# Compile: optimize prompts for this specific task
compiled_extract = optimizer.compile(
extract,
trainset=trainset
)
BootstrapFewShot generates demonstrations from the training set, selecting examples that the program handles correctly, then uses those in-context examples during inference. This is a form of few-shot prompt optimization: the optimizer identifies which examples help and constructs the in-context prompt from them.
For complex programs, BootstrapFewShotWithRandomSearch explores multiple configurations:
from dspy.teleprompt import BootstrapFewShotWithRandomSearch
# More thorough optimizer
optimizer = BootstrapFewShotWithRandomSearch(
metric=eval_metric,
max_bootstrapped_demos=4,
max_labeled_demos=8,
num_candidates=7,
random_key=42
)
compiled = optimizer.compile(program, trainset=trainset)
The optimizer explores 7 candidate configurations, selecting the one that performs best on held-out examples. More candidates mean better exploration at higher compute cost.
Understanding optimizer limits:
Data dependency: Optimizers require labeled training data. The quality of optimization depends on label quality. Noisy labels produce prompts optimized to fit noise.
Generalization: The optimizer tunes prompts for the distribution present in training data. Distribution shift (different inputs than training) may degrade performance. Overfitting to training distribution is possible.
Reproducibility: Searching across candidate configurations introduces randomness. With identical data and random seed, results should be reproducible. Without seed control, varying results are expected.
Model coupling: Optimized prompts may be coupled to the model used during optimization. Switching models requires re-optimization. This isn't always documented clearly.
# Practical workflow: optimize, then evaluate on held-out data
# Split data
train_split = dataset[:80]
eval_split = dataset[80:]
# Optimize on training split
compiled = optimizer.compile(program, trainset=train_split)
# Evaluate on held-out split
correct = 0
total = 0
for example in eval_split:
pred = compiled(text=example.text)
if pred.name.strip().lower() == example.name.strip().lower():
correct += 1
total += 1
accuracy = correct / total
print(f"Held-out accuracy: {accuracy}")
The evaluation after optimization is essential. Optimization metrics are on the training set and may not reflect generalization. A program with 100% training accuracy and 60% held-out accuracy is overfitting—the prompts have been tuned to training data artifacts.
Take a signature from the previous chapter and create a minimal labeled dataset (10-20 examples). Run BootstrapFewShot with 3 bootstrap demonstrations and evaluate on held-out data. Compare accuracy to the uncompiled baseline. Determine whether optimization helped and whether the improvement is likely to generalize.