RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Advanced Prompt Engineering
  6. /Ch. 10
Advanced Prompt Engineering

10. Automated Prompt Tuning

Chapter 10 of 18 · 15 min
KEY INSIGHT

Automated tuning amplifies whatever bias exists in the evaluation criteria—a prompt that scores 95% on a flawed rubric will fail in production.

Prompt tuning involves systematically modifying prompts to improve output quality without changing the underlying model. Automated tuning goes further by using feedback loops to optimize prompts programmatically.

Gradient-Based vs. Discrete Tuning

There are two approaches to automated prompt tuning:

Discrete tuning modifies prompt text directly based on evaluation metrics. This works well for structured prompts where changes are visible and measurable.

Continuous (soft) prompting trains embedding vectors that replace discrete tokens. These learned prompts live in the model's embedding space rather than human-readable text.

# Discrete tuning example using OpenAI Evals framework
from evals.api import completion_fn

def evaluate_prompt(prompt_template, test_cases):
    scores = []
    for case in test_cases:
        response = completion_fn(
            prompt=prompt_template.format(**case["input"]),
            model="gpt-4"
        )
        scores.append(evaluate_response(response, case["expected"]))
    return sum(scores) / len(scores)

def tune_discrete_prompt(prompt, test_cases, iterations=20):
    current_prompt = prompt
    best_score = evaluate_prompt(current_prompt, test_cases)
    
    for i in range(iterations):
        # Generate candidate modifications
        candidates = [
            add_few_shot_example(current_prompt),
            add_step_by_step_instruction(current_prompt),
            add_constraint_clause(current_prompt),
            simplify_grammar(current_prompt)
        ]
        
        for candidate in candidates:
            score = evaluate_prompt(candidate, test_cases)
            if score > best_score:
                current_prompt = candidate
                best_score = score
                break
    
    return current_prompt, best_score

Tuning with Local Models

Using Ollama for automated tuning keeps costs low and allows faster iteration cycles:

# Run evaluation suite against local model during tuning
ollama run llama3:70b << 'EOF'
Analyze this tutoring response for:
1. Scaffolding level (1-5)
2. Conceptual clarity (1-5)
3. Engagement quality (1-5)

Response: {RESPONSE_PLACEHOLDER}

Output JSON: {"scaffolding": X, "clarity": Y, "engagement": Z}
EOF

Common Failure Modes

Automated tuning can produce prompts that score well on test cases but fail on edge cases. This happens when test suites lack diversity. Always include adversarial test cases in evaluation sets.

Another failure mode: prompt overfitting where the tuned prompt performs spectacularly on evaluation data but generalization drops significantly. Use a holdout set to detect this.

EXERCISE

Implement a simple discrete tuner that modifies a customer service prompt by adding/removing instructions, then evaluate against a 20-case test set. Measure how many iterations until convergence (score improvement < 0.01).

← Chapter 9
DSPy Optimizers
Chapter 11 →
Prompt Version Control