HOW-TO · SUP

How to use DSPy for prompt optimization

advanced30 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

DSPy installed, LLM endpoint, optimization target

What this does

Using DSPy for prompt optimization automates the process of finding the best prompt structure for a given task. Instead of manually tweaking prompt text, DSPy treats prompting as a machine learning optimization problem. A program is written using declarative modules (signatures), a training dataset of input-output examples is provided, and DSPy's optimizers automatically tune the prompts—or even the few-shot examples—to maximize a specified metric. The result is a programmatically optimized prompt that outperforms hand-crafted alternatives.

Steps

Configure the LLM connection: import dspy; lm = dspy.OllamaLocal(model="llama3", max_tokens=500); dspy.settings.configure(lm=lm). Define a signature that declares the input-output structure: class QASignature(dspy.Signature): question = dspy.InputField(); answer = dspy.OutputField(). Build a module using this signature: class QABot(dspy.Module): def __init__(self): self.generate = dspy.ChainOfThought(QASignature); def forward(self, question): return self.generate(question=question). Prepare the training data as a list of dspy.Example objects: trainset = [dspy.Example(question="What is the capital of France?", answer="Paris").with_inputs("question"), ...]. Define a metric: def accuracy_metric(example, pred, trace=None): return int(example.answer.lower() == pred.answer.lower()). Run optimization: choose an optimizer based on needs. BootstrapFewShot generates few-shot examples: optimizer = dspy.BootstrapFewShot(metric=accuracy_metric, max_bootstrapped_demos=4) and optimized_bot = optimizer.compile(QABot(), trainset=trainset). For instruction tuning with more data, use MIPROv2 or BootstrapFewShotWithRandomSearch. Save the optimized program: optimized_bot.save("optimized_qa_bot.json"). Evaluate on a held-out test set: evaluator = dspy.Evaluate(devset=devset, metric=accuracy_metric); score = evaluator(optimized_bot).

  • Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

  • Confirm the local starting state. Print the active binary, package version, model name, or configuration path before changing the workflow.

  • Run the smallest complete path. Execute the minimum command or script that proves the guide works end to end on the local machine.

  • Compare against expected output. Check the final line, status code, generated artifact, or model response against the verification section before expanding the setup.

  • Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

Verification

Run the optimized program on 5 test examples and confirm the accuracy exceeds the baseline (unoptimized) program. Inspect the saved JSON—it should contain the modified prompt instructions and optionally few-shot examples. Compare the output quality qualitatively: the optimized prompts should produce more consistent and accurate answers than the raw signature. Run evaluation on the full test set and verify the reported score matches manual spot-checks. Test with a different but related question to ensure the optimization generalized—no overfitting to exact training examples.

Common failures

Optimizer runs out of memory: Reduce max_bootstrapped_demos or limit max_labeled_demos when using large training sets. Metric always returns zero: Verify the metric function correctly extracts fields from the prediction object; add debug print statements to inspect pred.answer format. Local model too slow for optimization: Set dspy.settings.configure(experimental=True) to reduce unnecessary validation calls, or use a smaller model for bootstrapping and a larger one for evaluation. Overfitting to training examples: Reserve at least 20% of data as a dev set and monitor the metric gap between train and dev; if the gap exceeds 0.15, increase the training set size or reduce demo count. Signature fields mismatch: Ensure the OutputField name in the signature matches what the forward method returns and what the metric function expects.

  • Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
  • Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Related guides

  • setup-prompt-layer-prompt-management
  • build-rag-evaluation-pipeline
  • implement-ab-testing-model-responses