RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Advanced Prompt Engineering
  6. /Ch. 9
Advanced Prompt Engineering

09. DSPy Optimizers

Chapter 9 of 18 · 20 min
KEY INSIGHT

DSPy optimizers search for prompt configurations that maximize evaluation metrics on training data, with the critical caveat that optimized prompts may overfit to training distribution.

DSPy optimizers tune prompts and language model configurations automatically. The optimizer receives a program (a sequence of modules), training data with labels, and an evaluation metric. It searches for the configuration that maximizes the metric.

The basic pattern:

import dspy
from dspy.teleprompt import BootstrapFewShot

# Assume we have a labeled dataset
trainset = [
    dspy.Example(
        text="CEO John Morrison announced Q3 earnings exceeded expectations by 12%",
        name="John Morrison",
        role="CEO",
        organization="Unnamed Company"
    ).with_inputs('text'),
    # ... more examples
]

# The signature we want to optimize
class ExtractPersonInfo(dspy.Signature):
    text = dspy.InputField()
    name = dspy.OutputField()
    role = dspy.OutputField()
    organization = dspy.OutputField()

# Optimizer setup
optimizer = BootstrapFewShot(
    metric=lambda example, pred, trace: (
        pred.name.strip().lower() == example.name.strip().lower() and
        pred.role.strip().lower() == example.role.strip().lower()
    ),
    max_bootstrapped_demos=4,
    max_labeled_demos=4
)

# Create the module
extract = dspy.Predict(ExtractPersonInfo)

# Compile: optimize prompts for this specific task
compiled_extract = optimizer.compile(
    extract,
    trainset=trainset
)

BootstrapFewShot generates demonstrations from the training set, selecting examples that the program handles correctly, then uses those in-context examples during inference. This is a form of few-shot prompt optimization: the optimizer identifies which examples help and constructs the in-context prompt from them.

For complex programs, BootstrapFewShotWithRandomSearch explores multiple configurations:

from dspy.teleprompt import BootstrapFewShotWithRandomSearch

# More thorough optimizer
optimizer = BootstrapFewShotWithRandomSearch(
    metric=eval_metric,
    max_bootstrapped_demos=4,
    max_labeled_demos=8,
    num_candidates=7,
    random_key=42
)

compiled = optimizer.compile(program, trainset=trainset)

The optimizer explores 7 candidate configurations, selecting the one that performs best on held-out examples. More candidates mean better exploration at higher compute cost.

Understanding optimizer limits:

  1. Data dependency: Optimizers require labeled training data. The quality of optimization depends on label quality. Noisy labels produce prompts optimized to fit noise.

  2. Generalization: The optimizer tunes prompts for the distribution present in training data. Distribution shift (different inputs than training) may degrade performance. Overfitting to training distribution is possible.

  3. Reproducibility: Searching across candidate configurations introduces randomness. With identical data and random seed, results should be reproducible. Without seed control, varying results are expected.

  4. Model coupling: Optimized prompts may be coupled to the model used during optimization. Switching models requires re-optimization. This isn't always documented clearly.

# Practical workflow: optimize, then evaluate on held-out data

# Split data
train_split = dataset[:80]
eval_split = dataset[80:]

# Optimize on training split
compiled = optimizer.compile(program, trainset=train_split)

# Evaluate on held-out split
correct = 0
total = 0
for example in eval_split:
    pred = compiled(text=example.text)
    if pred.name.strip().lower() == example.name.strip().lower():
        correct += 1
    total += 1

accuracy = correct / total
print(f"Held-out accuracy: {accuracy}")

The evaluation after optimization is essential. Optimization metrics are on the training set and may not reflect generalization. A program with 100% training accuracy and 60% held-out accuracy is overfitting—the prompts have been tuned to training data artifacts.

EXERCISE

Take a signature from the previous chapter and create a minimal labeled dataset (10-20 examples). Run BootstrapFewShot with 3 bootstrap demonstrations and evaluate on held-out data. Compare accuracy to the uncompiled baseline. Determine whether optimization helped and whether the improvement is likely to generalize.

← Chapter 8
DSPy Signatures
Chapter 10 →
Automated Prompt Tuning