DSPy Optimizers — Advanced Prompt Engineering (Chapter 9)

DSPy optimizers tune prompts and language model configurations automatically. The optimizer receives a program (a sequence of modules), training data with labels, and an evaluation metric. It searches for the configuration that maximizes the metric.

The basic pattern:

import dspy
from dspy.teleprompt import BootstrapFewShot

# Assume we have a labeled dataset
trainset = [
    dspy.Example(
        text="CEO John Morrison announced Q3 earnings exceeded expectations by 12%",
        name="John Morrison",
        role="CEO",
        organization="Unnamed Company"
    ).with_inputs('text'),
    # ... more examples
]

# The signature we want to optimize
class ExtractPersonInfo(dspy.Signature):
    text = dspy.InputField()
    name = dspy.OutputField()
    role = dspy.OutputField()
    organization = dspy.OutputField()

# Optimizer setup
optimizer = BootstrapFewShot(
    metric=lambda example, pred, trace: (
        pred.name.strip().lower() == example.name.strip().lower() and
        pred.role.strip().lower() == example.role.strip().lower()
    ),
    max_bootstrapped_demos=4,
    max_labeled_demos=4
)

# Create the module
extract = dspy.Predict(ExtractPersonInfo)

# Compile: optimize prompts for this specific task
compiled_extract = optimizer.compile(
    extract,
    trainset=trainset
)

BootstrapFewShot generates demonstrations from the training set, selecting examples that the program handles correctly, then uses those in-context examples during inference. This is a form of few-shot prompt optimization: the optimizer identifies which examples help and constructs the in-context prompt from them.

For complex programs, BootstrapFewShotWithRandomSearch explores multiple configurations:

from dspy.teleprompt import BootstrapFewShotWithRandomSearch

# More thorough optimizer
optimizer = BootstrapFewShotWithRandomSearch(
    metric=eval_metric,
    max_bootstrapped_demos=4,
    max_labeled_demos=8,
    num_candidates=7,
    random_key=42
)

compiled = optimizer.compile(program, trainset=trainset)

The optimizer explores 7 candidate configurations, selecting the one that performs best on held-out examples. More candidates mean better exploration at higher compute cost.

Understanding optimizer limits:

Data dependency: Optimizers require labeled training data. The quality of optimization depends on label quality. Noisy labels produce prompts optimized to fit noise.
Generalization: The optimizer tunes prompts for the distribution present in training data. Distribution shift (different inputs than training) may degrade performance. Overfitting to training distribution is possible.
Reproducibility: Searching across candidate configurations introduces randomness. With identical data and random seed, results should be reproducible. Without seed control, varying results are expected.
Model coupling: Optimized prompts may be coupled to the model used during optimization. Switching models requires re-optimization. This isn't always documented clearly.

# Practical workflow: optimize, then evaluate on held-out data

# Split data
train_split = dataset[:80]
eval_split = dataset[80:]

# Optimize on training split
compiled = optimizer.compile(program, trainset=train_split)

# Evaluate on held-out split
correct = 0
total = 0
for example in eval_split:
    pred = compiled(text=example.text)
    if pred.name.strip().lower() == example.name.strip().lower():
        correct += 1
    total += 1

accuracy = correct / total
print(f"Held-out accuracy: {accuracy}")

The evaluation after optimization is essential. Optimization metrics are on the training set and may not reflect generalization. A program with 100% training accuracy and 60% held-out accuracy is overfitting—the prompts have been tuned to training data artifacts.