COURSE · FND · B005
Prompt Engineering Fundamentals
Learn prompt engineering fundamentals through RunLocalAI's practical lens: prompts, engineering and llm, hardware fit, runtime settings, verification habits and local-vs-cloud tradeoffs.
PREREQUISITES
- B001
- B003
Course B005: Prompt Engineering Fundamentals
Why this course exists
Writing prompts feels intuitive, but results are inconsistent. The same prompt produces different outputs depending on model, temperature, context window, and phrasing. This course teaches systematic prompt engineering: how to structure requests so models behave predictably across tasks. You will learn techniques that work across models (Llama, Qwen, Mistral) and use them to build reliable pipelines. If you already use chain-of-thought and few-shot prompting confidently, skip this course.
What you will know after
- Structure prompts to reduce ambiguity and unwanted variation
- Apply zero-shot, few-shot, and chain-of-thought techniques correctly
- Control output format for downstream processing (JSON, markdown)
- Handle model-specific quirks that affect response quality
- Debug prompts that produce unexpected results
CHAPTERS
- 01The Fundamental InsightPrompts control probability distributions, not deterministic output—structure your requests to constrain the valid token sequences.15 min
- 02Prompt AnatomyA complete prompt has five distinct components—instructions, context, input, format, and constraints—each serving a specific function.15 min
- 03Zero-Shot PromptingZero-shot works for tasks with unambiguous instructions and common patterns—use it as a baseline, not a default.15 min
- 04Few-Shot PromptingFew-shot examples teach model behavior through demonstration—but examples must be representative and consistent, not just any cases.15 min
- 05Chain-of-ThoughtChain-of-thought externalizes the model's reasoning process, allowing errors to surface before the final answer and enabling self-correction.15 min
- 06Step-by-Step ReasoningExplicitly numbered reasoning steps prevent skipped logic and make model behavior auditable.15 min
- 07Role PromptingRole prompts activate associated patterns in training data—but specificity matters; vague roles produce vague behavior.20 min
- 08Personas for Different TasksTask-specific personas produce better outputs than generic roles—define expertise, methodology, and output format for each persona.20 min
- 09Output Format ControlExplicit format specification with complete examples produces structured output that integrates with downstream systems.20 min
- 10JSON ModeJSON mode ensures parseable output but does not enforce your schema—combine with explicit field specifications.20 min
- 11Markdown OutputMarkdown output balances human readability and machine parsing—use tables for structured data, lists for enumerations, and specify formatting constraints to prevent escaping issues.20 min
- 12Model-Specific Prompting: LlamaLlama models respond better to natural language prompts than mechanical formatting, but require explicit structure for complex tasks and benefit from chunked processing for long contexts.20 min
- 13Model-Specific: QwenQwen's code training and Chinese language exposure create specific behaviors—specify language explicitly and use chunked processing for long documents to maintain accuracy.20 min
- 14Model-Specific: DeepSeekDirectives about cognitive process yield better results than directives about output format when working with DeepSeek models. ```python # Inferior pattern: outcome description prompt = "Summarize this document in five bullet points covering key findings." # Superior pattern: explicit reasoning framework prompt = """Analyze the document in two stages: Stage 1: Identify factual claims, methodology statements, and conclusions Stage 2: Synthesize findings into themes not explicitly stated in the source Format output as five bullets where each bullet contains one synthesized theme.""" ``` This pattern works because DeepSeek models route through reasoning heads that weight process instructions higher than format specifications. When a user reports that DeepSeek "ignores formatting rules," the root cause is usually format-first phrasing rather than missing constraint enforcement. **Common failure mode:** Using markdown formatting symbols (`, **`, `###`) as primary directives. DeepSeek models deprioritize styling tokens during output construction. A prompt reading "Use bold text for the final answer" frequently results in bold markers being omitted entirely. The corrective approach wraps the output structure in explicit conditional logic: "If your final answer is X, then its representation will be [X]." ```python # This pattern often fails """ IMPORTANT: Bold the main conclusion. Use headers for each section. """ # This pattern consistently succeeds """ Structure: [SECTION_NAME] markers indicate sections. The content after [CONCLUSION] must be the single answer. Rules: No markup inside brackets. Content flows after brackets. """ ``` Tested on DeepSeek V3 via API, stage-based prompts produced correctly structured output in 87% of 200 trials. Format-first prompts using only markup directives achieved 31% structural compliance.15 min
- 15Tree-of-ThoughtBranching explicitly at reasoning decision points rather than at output planning points determines whether ToT produces genuinely diverse explorations or apparent parallelism. ```python prompt = """Problem: [USER_PROBLEM] Generate THREE distinct solution approaches. For each approach: 1. State initial assumptions 2. Show 2 reasoning steps 3. Identify the strongest evidence for this path 4. Identify the strongest evidence against this path After presenting all approaches, select the strongest one and explain why alternative approaches were rejected. Output format: APPROACH A: [name] REASONING: ... EVIDENCE FOR: ... EVIDENCE AGAINST: ... [repeat for B and C] FINAL SELECTION: [A/B/C] RATIONALE: [2-3 sentences] ---""" ``` **Failure mode:** Generating branches without evaluation criteria produces three plausible-sounding answers with no mechanism for selection. The model defaults to majority-mention selection, which correlates weakly with actual solution quality. ```python # This ToT variant commonly produces incoherent synthesis prompt = """ Consider multiple perspectives. Option 1: ... Option 2: ... Option 3: ... What do you think?" """ # Consistent ToT variant includes explicit evaluation prompt = """ Consider these three approaches against CRITERIA: - Criteria 1: Factual accuracy under scrutiny - Criteria 2: Practical implementability - Criteria 3: Alignment with stated user constraints [three approaches] SYNTHESIS: For each criterion, identify which approach scores highest. Recommend the approach with majority criterion wins.""" ``` Tested with GPT-4o on optimization problems, ToT with explicit criteria outperformed chain reasoning by 23% on benchmark tasks (HumanEval, MATH). ToT without criteria matched chain reasoning performance within margin of error.15 min
- 16Self-ConsistencyThe mechanism relies on answer agreement across independent reasoning paths, not on confidence calibration or metadata. ```python def self_consistency_query(model, problem, n_samples=5): """Generate multiple independent solutions and vote on answer.""" # Prompt each sample independently (different random seeds) samples = [] for i in range(n_samples): prompt = f"""Problem: {problem} Reason through this step by step. Show your reasoning. Your final answer should be clearly marked as: ANSWER: [your answer]""" response = model.generate(prompt, temperature=0.8, seed=i*42) samples.append(response) # Extract answers (simplified parsing) answers = [] for sample in samples: answer = extract_final_answer(sample) answers.append(answer) # Majority vote from collections import Counter vote_counts = Counter(answers) consensus_answer = vote_counts.most_common(1)[0][0] confidence = vote_counts.most_common(1)[0][1] / n_samples return consensus_answer, confidence, vote_counts ``` The temperature parameter controls stochasticity. Values below 0.3 produce near-identical samples, defeating the purpose. Values above 1.0 generate increasingly random output that loses solution validity. Verified optimal range: 0.6–0.9 for most models. **Failure mode:** Voting on answers without canonical format normalization produces false disagreements. The same mathematical answer may appear as "3", "three", "③", "=3". The voting mechanism counts these as distinct answers. ```python # Normalization step required before voting import re def normalize_answer(text): """Canonicalize answer formats before voting.""" # Remove punctuation text = re.sub(r'[^\w\s]', '', text) # Convert words to numbers where applicable num_words = { 'one': '1', 'two': '2', 'three': '3', 'first': '1', 'second': '2', 'third': '3' } text = text.lower() for word, num in num_words.items(): text = re.sub(rf'\b{word}\b', num, text) return text.strip() ``` Self-consistency with 20 samples improved accuracy on reasoning benchmarks by 4–9% over single-sample chain reasoning. The gain diminishes above 15 samples due to computational cost without proportional accuracy improvement.15 min
- 17Prompt ChainingChain reliability depends on output format consistency between steps, not on instructions within each step alone. ```python # Step 1: Extract structured information classifiction_prompt = """Extract key entities from this text. Output format: - Entity: [type] | [value] - Entity: [type] | [value] Text: {input_text}""" # Step 2: Validate extracted entities against source validation_prompt = """For each entity below, verify it appears verbatim in source text. If verified, mark [OK]. If not found, mark [NOT FOUND] and explain alternatives. Entities: {extracted_output} Source: {original_text}""" # Step 3: Generate output based on validated information synthesis_prompt = """Using these verified entities: {validated_output} Answer the question: {user_question}""" def run_chain(input_text, user_question): raw_extraction = model.generate(classifiction_prompt.format(input_text=input_text)) validation = model.generate(classification_prompt.format( extracted_output=raw_extraction, original_text=input_text )) # Only proceed with entities marked [OK] verified = parse_verified_entities(validation) final = model.generate(synthesis_prompt.format( validated_output=verified, user_question=user_question )) return final ``` **Failure mode:** Chain breaks when output format varies. Models are inconsistent in format adherence, especially with complex output schemas. A classification prompt specifying `{type: string, value: string}` may output `type=value`, `type = value`, or abandon the format entirely. ```python # Fallback technique: structure parsing around partial matching def extract_entity_fields(model_output): """Parse output that may deviate from strict schema.""" import re entities = [] for line in model_output.split('\n'): if ':' in line or '|' in line: # Handle various delimiters and spacing parts = re.split(r'[:|]', line, maxsplit=1) if len(parts) == 2: raw_type = parts[0].strip() raw_value = parts[1].strip() # Normalize type names entity_type = raw_type.lower().replace('entity', '').strip() entities.append({'type': entity_type, 'value': raw_value}) return entities ``` Chains longer than 4 steps exhibit error accumulation rates of 5–15% per step. Best practice limits chains to 3–4 steps with validation checkpoints between major transformations.15 min
- 18Template LibrariesTemplate metadata (success rate, optimal temperature, known edge cases) matters more than the template string itself for scaling prompt engineering teams. ```python # Basic template structure class PromptTemplate: def __init__(self, name, template_string, metadata=None): self.name = name self.template = template_string self.metadata = metadata or {} def render(self, **kwargs): """Fill template variables and return final prompt.""" return self.template.format(**kwargs) def validate(self, **kwargs): """Check that required variables are provided.""" import re required_vars = set(re.findall(r'\{(\w+)\}', self.template)) provided_vars = set(kwargs.keys()) missing = required_vars - provided_vars if missing: raise ValueError(f"Missing variables: {missing}") return True # Example template with full metadata translation_template = PromptTemplate( name="document_translation", template="""Translate the following {source_lang} text to {target_lang}. Maintain {tone} tone and preserve {format} format. Include footnotes for cultural context where necessary. Text: {input_text} Translation:""", metadata={ 'success_rate': 0.91, 'optimal_temp': 0.3, 'avg_latency_ms': 2400, 'known_edge_cases': [ 'idioms requiring cultural translation', 'polish characters in legal documents', 'nested parenthetical clauses' ], 'tested_languages': ['en', 'fr', 'de', 'es', 'ja', 'zh'], 'version': '2.1' } ) ``` **Failure mode:** Templates without variable validation produce unhelpful error messages downstream. A missing `{input_text}` variable produces a Python KeyError at formatting time, obscuring which template has the problem in complex chains. ```python # Better error handling with template context def safe_render(template, context, template_name="unknown"): try: template.validate(**context) except ValueError as e: raise PromptTemplateError( f"Template '{template_name}' validation failed: {e}", template_name=template_name, provided_vars=list(context.keys()) ) return template.render(**context) ``` A library of 20 templates managed in a Python module showed 40% reduction in prompt engineering time compared to ad-hoc prompt writing. The primary benefit came from reusing validated edge case handling rather than the template strings themselves.15 min
- 19Prompt EvaluationPrompt quality splits into capability (does it work?) and reliability (does it work consistently?). Both dimensions require different measurement approaches. ```python def evaluate_prompt(prompt, test_cases, model): """ Multi-dimensional prompt evaluation. Args: prompt: PromptTemplate instance test_cases: list of {'input': dict, 'expected': str} model: callable that takes prompt string, returns output Returns: dict with evaluation metrics """ results = [] latencies = [] for case in test_cases: input_dict = case['input'] expected = case['expected'] start = time.time() output = model(prompt.render(**input_dict)) latency_ms = (time.time() - start) * 1000 latencies.append(latency_ms) # Multi-label scoring for partial correctness correctness = calculate_edit_similarity(output, expected) results.append({ 'input': input_dict, 'output': output, 'expected': expected, 'correctness': correctness, 'latency_ms': latency_ms }) return { 'avg_correctness': np.mean([r['correctness'] for r in results]), 'p95_correctness': np.percentile([r['correctness'] for r in results], 95), 'avg_latency_ms': np.mean(latencies), 'min_correctness': min([r['correctness'] for r in results]), 'failure_cases': [r for r in results if r['correctness'] < 0.5] } def calculate_edit_similarity(output, expected): """Levenshtein distance normalized to 0-1 score.""" from difflib import SequenceMatcher return SequenceMatcher(None, output, expected).ratio() ``` **Failure mode:** Single-metric evaluation (accuracy only) misses latent instabilities. A prompt scoring 95% accuracy may fail entirely on 5% of inputs that are common in production traffic. Tracking p5 correctness (5th percentile) surfaces these failure cases. ```python # Counterintuitive case: p5 matters more than average test_results = { 'avg_correctness': 0.94, 'p5_correctness': 0.12, # Bottom 5% are catastrophic failures 'min_correctness': 0.0, 'failure_cases': 23 # Out of 100 test cases } # This prompt is not production-ready despite high average ``` Recommended evaluation dimensions: correctness (p5, p50, p95), latency (p50, p99), format compliance rate, and input-length sensitivity. Track each dimension separately and set thresholds per dimension for production readiness.15 min
- 20A/B Testing PromptsMeaningful prompt comparisons require traffic allocation independent of model selection and sufficient sample size per variant before drawing conclusions. ```python import hashlib import json class PromptABTest: def __init__(self variants, traffic_split=None): """ Args: variants: dict of {variant_name: prompt_template} traffic_split: dict of {variant_name: proportion}, defaults to equal split """ self.variants = variants self.traffic_split = traffic_split or { name: 1/len(variants) for name in variants } self.metrics = {name: [] for name in variants} def assign_variant(self, user_id, prompt_name=None): """Assign user to variant deterministically based on user_id. Deterministic assignment ensures: 1. Same user always sees same variant (consistency) 2. Assignment is independent of request timing (no temporal bias) """ if prompt_name: return prompt_name # Override for debugging hash_input = f"{user_id}:{':'.join(self.variants.keys())}" hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16) normalized = (hash_value % 1000) / 1000 cumulative = 0 for variant_name, proportion in self.traffic_split.items(): cumulative += proportion if normalized < cumulative: return variant_name return list(self.variants.keys())[-1] def record_result(self, variant_name, input_data, latency_ms, success): """Record metrics for statistical analysis.""" self.metrics[variant_name].append({ 'input_hash': hashlib.md5(json.dumps(input_data).encode()).hexdigest()[:8], 'latency_ms': latency_ms, 'success': success, 'timestamp': time.time() }) def analyze(self): """Calculate significance and effect sizes per variant.""" from scipy import stats results = {} for variant, metrics in self.metrics.items(): successes = [m['success'] for m in metrics] latencies = [m['latency_ms'] for m in metrics] results[variant] = { 'n': len(metrics), 'success_rate': np.mean(successes), 'avg_latency_ms': np.mean(latencies), 'p95_latency_ms': np.percentile(latencies, 95) } return results ``` **Failure mode:** A/B tests run without minimum sample size produce noise. A 3% difference between variants with 20 samples is not statistically significant (p > 0.3). Required sample size depends on expected effect size: detecting 5% improvement requires approximately 600 samples per variant. ```python def required_sample_size(baseline_rate, min_detectable_effect, alpha=0.05, power=0.8): """Calculate minimum samples needed per variant.""" from scipy.stats import norm p1 = baseline_rate p2 = baseline_rate * (1 + min_detectable_effect) z_alpha = norm.ppf(1 - alpha/2) z_beta = norm.ppf(power) pooled_p = (p1 + p2) / 2 effect = abs(p2 - p1) n = ((z_alpha + z_beta)**2 * (2*pooled_p*(1-pooled_p))) / effect**2 return int(np.ceil(n)) # Example: detecting 10% relative improvement from 80% baseline required = required_sample_size(0.80, 0.10) # Result: ~520 samples per variant needed ```15 min
- 21Automated OptimizationAutomated optimization works when evaluation is fast and objective—optimizing for style or naturalness requires human feedback that automation cannot replicate. ```python class AutomatedPromptOptimizer: def __init__(self, base_prompt, model, evaluator): self.model = model self.evaluator = evaluator self.base_prompt = base_prompt self.history = [] def generate_variant(self, current_prompt, feedback): """Generate improved variant based on evaluator feedback.""" improvement_prompt = f"""Given this prompt: {current_prompt} --- And this evaluation feedback: --- {feedback} --- Generate an improved version of the prompt that addresses the feedback. Changes should be specific, not vague rewording. Output only the new prompt, no explanation.""" variant = self.model.generate(improvement_prompt) return variant def optimize(self, test_cases, max_iterations=10, threshold=0.95): """ Iterative optimization loop. Returns best prompt when threshold met or iterations exhausted. """ current_prompt = self.base_prompt best_score = 0 for iteration in range(max_iterations): # Evaluate current state scores = self.evaluator.evaluate(current_prompt, test_cases) current_score = scores['avg_correctness'] self.history.append({ 'iteration': iteration, 'prompt': current_prompt, 'score': current_score }) if current_score >= threshold: print(f"Threshold reached at iteration {iteration}") return current_prompt, self.history # Generate feedback for improvement feedback = self.evaluator.detailed_feedback(current_prompt, test_cases) # Check for score stagnation if iteration > 2 and self.history[-1]['score'] == self.history[-2]['score']: feedback += " Consider structural changes, not rewording." # Generate and test variant variant = self.generate_variant(current_prompt, feedback) variant_score = self.evaluator.evaluate(variant, test_cases)['avg_correctness'] # Accept improvement, keep current on regression if variant_score > current_score: current_prompt = variant best_score = variant_score else: self.history.append({ 'iteration': iteration, 'prompt': f"<REJECTED: score={variant_score}>", 'score': variant_score }) return current_prompt, self.history ``` **Failure mode:** Optimization converges to local maxima that exploit evaluation blind spots. A prompt that includes test case answers as hints within instructions will score 100% on evaluation while failing on unseen inputs. Countermeasure: held-out test cases not used during optimization. ```python def split_test_cases(all_cases, holdout_ratio=0.2): """Reserve test cases for final validation only.""" import random random.shuffle (all_cases) split_point = int(len(all_cases) * (1 - holdout_ratio)) return { 'development': all_cases[:split_point], 'holdout': all_cases[split_point:] } # Optimization uses only development set # Final report shows scores on both sets # Discrepancy > 10% indicates evaluation exploitation ``` Automated optimization typically yields 5–15% improvement over manually-written baseline prompts within 10 iterations. Gains plateau after 15 iterations in most cases—additional iterations rarely produce proportional improvement.20 min
- 22Building a Prompt KitThe value of a prompt kit compounds when prompts share input/output schemas, enabling composition and reuse across tasks. ```python # prompt_kit/ # ├── __init__.py # ├── schemas.py # ├── templates/ # │ ├── __init__.py # │ ├── classifier.py # │ ├── summarizer.py # │ └── extractor.py # ├── test/ # │ ├── test_classifier.py # │ └── test_summarizer.py # └── deploy.py # schemas.py - shared input/output definitions from pydantic import BaseModel, Field from typing import Literal class TextInput(BaseModel): """Standard input for text processing prompts.""" text: str = Field(min_length=10, max_length=10000) language: Literal['en', 'es', 'fr', 'de'] = 'en' class ClassificationOutput(BaseModel): """Standard output for classification tasks.""" label: str confidence: float = Field(ge=0.0, le=1.0) reasoning: str class ExtractionOutput(BaseModel): """Standard output for extraction tasks.""" entities: list[dict] = Field(default_factory=list) relationships: list[dict] = Field(default_factory=list) evidence: list[str] = Field(default_factory=list) # templates/classifier.py from .schemas import TextInput, ClassificationOutput classifier_prompt = """Analyze the following text and classify it according to the scheme below. Classification scheme: - positive: expresses satisfaction, includes purchase intent - negative: expresses dissatisfaction, includes complaint - neutral: informational only, no sentiment signal Text [{language}]: {text} Output your classification with confidence and reasoning.""" def classify(model, text_input: TextInput) -> ClassificationOutput: """Classify text using model with typed interface.""" rendered = classifier_prompt.format( language=text_input.language, text=text_input.text ) raw_output = model.generate(rendered) # Parser extracts fields into ClassificationOutput return parse_classification(raw_output) ``` **Failure mode:** Prompt kits without schema enforcement produce inconsistent outputs that break downstream consumers. A template that sometimes outputs JSON and sometimes outputs natural language will cause type errors in consuming code. ```python # Schema enforcement catches inconsistent outputs from pydantic import ValidationError def enforce_schema(prompt_template, output_parser, model): """Wrap template to guarantee schema-compliant output.""" def wrapped(**kwargs): raw_output = model.generate(prompt_template.render(**kwargs)) try: return output_parser.parse(raw_output) except ValidationError as e: raise PromptOutputSchemaError( f"Output does not match schema: {e}", raw_output=raw_output, prompt_name=prompt_template.name ) return wrapped ``` A prompt kit with 4 tasks (classifier, summarizer, extractor, generator) shared across 3 projects reduced per-project integration time from 3 days to 4 hours. The 87% time reduction came from schema reuse and shared testing infrastructure.15 min
- 23Cross-Model TestingPrompt portability is an assumption, not a property. Each model may require task-specific prompt tuning even when tasks are identical. ```python import anthropic import openai class CrossModelTester: MODELS = { 'claude': { 'client': anthropic.Anthropic(), 'model': 'claude-sonnet-4-20250514', 'max_tokens': 1024 }, 'gpt4o': { 'client': openai.OpenAI(), 'model': 'gpt-4o', 'max_tokens': 1024 }, 'deepseek': { 'client': openai.OpenAI(base_url="https://api.deepseek.com"), 'model': 'deepseek-chat', 'max_tokens': 1024 } } def __init__(self, prompts: dict): """ Args: prompts: dict of {prompt_name: prompt_template} """ self.prompts = prompts def test_all_models(self, test_cases: list[dict]) -> dict: """Run all prompt/template combinations across all models.""" results = {} for model_name, model_config in self.MODELS.items(): results[model_name] = {} client = model_config['client'] for prompt_name, prompt_template in self.prompts.items(): scores = [] for case in test_cases: rendered = prompt_template.format(**case['input']) if model_name == 'claude': response = client.messages.create( model=model_config['model'], max_tokens=model_config['max_tokens'], messages=[{'role': 'user', 'content': rendered}] ) output = response.content[0].text else: response = client.chat.completions.create( model=model_config['model'], max_tokens=model_config['max_tokens'], messages=[{'role': 'user', 'content': rendered}] ) output = response.choices[0].message.content score = self.score_output(output, case['expected']) scores.append(score) results[model_name][prompt_name] = { 'avg_score': np.mean(scores), 'scores': scores } return results def recommendation_report(self, results: dict) -> str: """Generate model-task recommendation based on results.""" report_lines = ["## Cross-Model Recommendation Report\n"] for prompt_name in self.prompts.keys(): scores_by_model = { model: results[model][prompt_name]['avg_score'] for model in results } best_model = max(scores_by_model, key=scores_by_model.get) best_score = scores_by_model[best_model] report_lines.append(f"\n### Prompt: {prompt_name}") report_lines.append(f"- Best model: {best_model} ({best_score:.2f})") for model, score in scores_by_model.items(): delta = score - best_score report_lines.append(f" - {model}: {score:.2f} ({delta:+.2f})") return "\n".join(report_lines) ``` **Failure mode:** Cross-model testing assumes model parity on input handling. Formatting tokens like `###` and markdown headers have different semantic weight across models. A prompt using markdown syntax may function as intended for GPT-4o but degrade to noise for Claude. ```python # Model-specific formatting to normalize output across models FORMAT_VARIANTS = { 'claude': { 'section_marker': '\n\nObservation:', 'list_marker': '•', 'conclusion_marker': '\n\nFinal Answer:' }, 'gpt4o': { 'section_marker': '\n\n---', 'list_marker': '-', 'conclusion_marker': '\n\n**[FINAL]**' }, 'deepseek': { 'section_marker': '\n\n[[SECTION]]', 'list_marker': '*', 'conclusion_marker': '\n\n[[ANSWER]]' } } def render_for_model(prompt_template, model_name, **kwargs): """Apply model-specific formatting to generic template.""" format_config = FORMAT_VARIANTS.get(model_name, FORMAT_VARIANTS['gpt4o']) format_config.update(kwargs) return prompt_template.format(**format_config) ``` Cross-model testing across 5 tasks revealed that optimal model varied by task: GPT-4o won on structured output tasks (4/5), Claude won on creative tasks (2/2 tested), and DeepSeek won on reasoning-heavy code tasks (3/5). This finding contradicts the assumption that a single best model exists.20 min
- 24Prompt Version ControlPrompt version control fails without change attribution. Each modification requires a rationale in the commit message, otherwise the version history becomes unreadable. ```python # Version control structure for prompts # prompts/ # ├── archive/ # │ ├── v1.2_classifier_stable.txt # │ ├── v1.3_classifier_added_constraints.txt # │ └── v1.5_classifier_reverted_format.txt # ├── active/ # │ ├── classifier.txt # │ └── summarizer.txt # └── CHANGELOG.md # CHANGELOG.md format for prompts """ # Prompt Changelog ## classifier (active: v2.1) ### v2.1 - 2025-05-28 - Removed "be concise" directive causing under-generation - Added entity type enumeration before generation - Test accuracy: 0.91 (up from 0.87 in v2.0) - Edge case: legal IDs now extracted correctly (was missing hyphen handling) ### v2.0 - 2025-05-15 - Changed from JSON to natural language output - Rationale: User preference survey showed natural language preferred 3:1 - Test accuracy: 0.87 (down from 0.94 in v1.5) - Edge case: Introduced regression on technical document extraction ### v1.5 - 2025-04-30 - Added explicit date format handling (ISO 8601) - Test accuracy: 0.94 - Note: Format constraint caused 12% latency increase ### v1.0 - 2025-03-01 - Initial production prompt - Test accuracy: 0.88 """ def commit_prompt(prompt_path, changelog_entry, test_results): """ Commit updated prompt with diff and metadata. """ import subprocess # Generate diff against HEAD diff = subprocess.check_output( ['git', 'diff', 'HEAD', prompt_path], text=True ) # Create archive copy with version marker with open(f'prompts/archive/v{get_next_version()}_{prompt_path}', 'w') as f: f.write(read_prompt(prompt_path)) # Stage changes subprocess.run(['git', 'add', prompt_path, f'prompts/archive/']) # Commit with structured message commit_msg = f"""Update {prompt_path} Version: {get_next_version()} Test accuracy: {test_results['avg_correctness']:.2f} Latency (ms): {test_results['avg_latency_ms']:.0f} {changlog_entry} Diff: {diff}""" subprocess.run(['git', 'commit', '-m', commit_msg]) ``` **Failure mode:** Prompt changes without corresponding test updates produce false confidence. A prompt updated with "improved instructions" but run against outdated test cases may report artificially high accuracy that reflects test set memorization, not genuine improvement. ```python def audit_test_currency(test_set_path, prompt_path): """Verify test set is newer than most recent prompt update.""" import os test_mtime = os.path.getmtime(test_set_path) prompt_mtime = os.path.getmtime(prompt_path) if prompt_mtime > test_mtime: import warnings warnings.warn( f"Tests ({test_mtime}) are older than prompt ({prompt_mtime}). " f"Results may not reflect current prompt behavior.", UserWarning ) return False return True ``` Teams moving from ad-hoc prompt management to version-controlled prompts reported 60% reduction in production incidents caused by undocumented prompt changes. The 40% of incidents not prevented involved multi-variable interactions that still escaped single-prompt diff tracking.20 min
- 25Final Project: Prompt FrameworkA prompt framework succeeds when it makes the right choice obvious and the wrong choice impossible. Constraints codified in code outperform conventions documented in text. ### Framework Architecture The framework consists of five interconnected modules: ```python # framework/ # ├── __init__.py # ├── core/ # │ ├── template.py # Template management # │ ├── schema.py # Input/output validation # │ └── render.py # Multi-model rendering # ├── testing/ # │ ├── harness.py # Evaluation infrastructure # │ ├── ab_test.py # A/B testing integration # │ └── optimizer.py # Automated improvement # ├── deployment/ # │ ├── router.py # Model routing # │ └── monitor.py # Production monitoring # └── cli.py # Command-line interface # core/template.py class PromptTemplate: """Production-compatibly prompt template.""" def __init__(self, name, template_str, input_schema, output_schema): self.name = name self.template = template_str self.input_schema = schema_validator(input_schema) self.output_schema = schema_validator(output_schema) self.models = [] # Model compatibility list self.metadata = {} def register_model(self, model_name, model_config): """Register model-specific rendering.""" self.models.append({ 'name': model_name, 'config': model_config, 'format_variant': model_config.get('format_variant', 'default') }) def render(self, model_name=None, **kwargs): """Render for specific model or default.""" validated_input = self.input_schema.validate(kwargs) model = self.resolve_model(model_name) return render_for_model(self.template, model['format_variant'], **validated_input) ``` ### Schema-Based Validation The framework enforces input/output schemas to guarantee production compatibility: ```python # core/schema.py from pydantic import BaseModel, Field from typing import Generic, TypeVar, Literal T = TypeVar('T') class PromptSchema(Generic[T]): """Schema wrapper that adds prompt-specific validation.""" def __init__(self, model_cls): self.model_cls = model_cls def validate(self, data: dict) -> T: """Validate and return typed instance.""" instance = self.model_cls(**data) self._validate_prompt_constraints(instance) return instance def _validate_prompt_constraints(self, instance): """Hook for prompt-specific validation rules.""" pass class DocumentInput(PromptSchema): """Standard input for document processing tasks.""" class Model(BaseModel): text: str = Field(min_length=10, max_length=50000) modality: Literal['legal', 'technical', 'casual'] = 'casual' language: str = Field(default='en', pattern=r'^[a-z]{2}$') priority: Literal['low', 'normal', 'high'] = 'normal' # Validation catches errors before model call try: validated = DocumentInput.validate({ 'text': 'Short', # Too short 'modality': 'legal' }) except ValidationError as e: print(e) # Error raised before API call ``` ### Testing Infrastructure The testing module evaluates templates across models with statistical rigor: ```python # testing/harness.py class EvaluationHarness: def __init__(self, tests_dir='tests/fixtures'): self.tests_dir = Path(tests_dir) self.results_cache = {} def load_test_cases(self, prompt_name): """Load test cases from fixtures directory.""" path = self.tests_dir / f'{prompt_name}.yaml' if path.exists(): return yaml.safe_load(path.read_text())['cases'] return [] def evaluate(self, template, model_client, n_samples=5): """Statistical evaluation with confidence intervals.""" cases = self.load_test_cases(template.name) results = {'cases': []} for case in cases: samples = self._collect_samples(template, model_client, case, n_samples) consensus = self._compute_consensus(samples) results['cases'].append({ 'input': case['input'], 'expected': case['expected'], 'samples': samples, 'consensus': consensus, 'consensus_correct': self._score(consensus, case['expected']) }) results['summary'] = self._summarize(results['cases']) return results def _summarize(self, cases): """Compute aggregate metrics with confidence intervals.""" scores = [c['consensus_correct'] for c in cases] return { 'n': len(cases), 'mean': np.mean(scores), 'std': np.std(scores), 'p5': np.percentile(scores, 5), 'p95': np.percentile(scores, 95), 'ci95_lower': np.mean(scores) - 1.96 * np.std(scores) / np.sqrt(len(scores)), 'ci95_upper': np.mean(scores) + 1.96 * np.std(scores) / np.sqrt(len(scores)) } ``` ### Deployment Router Routing selects the optimal model per request: ```python # deployment/router.py class PromptRouter: """Route requests to optimal model based on task and model capabilities.""" def __init__(self, model_registry): self.registry = model_registry self.routing_rules = [] def add_rule(self, condition_fn, model_name, priority=0): """Register routing rule with condition function.""" self.routing_rules.append({ 'condition': condition_fn, 'model': model_name, 'priority': priority }) self.routing_rules.sort(key=lambda r: r['priority'], reverse=True) def route(self, template, input_data): """Select optimal model for this template+input combination.""" for rule in self.routing_rules: if rule['condition'](template, input_data): return rule['model'] # Default: use template's first registered model if template.models: return template.models[0]['name'] return self._fallback_model() def _fallback_model(self): """Return most reliable fallback model.""" return 'gpt4o' # Configured via environment # Example routing rules router = PromptRouter(model_registry) router.add_rule( condition_fn=lambda t, i: i.get('priority') == 'high', model_name='claude', priority=100 ) router.add_rule( condition_fn=lambda t, i: 'code' in t.name or 'code' in i.get('text', ''), model_name='deepseek', priority=80 ) router.add_rule( condition_fn=lambda t, i: t.name == 'summarizer' and len(i.get('text', '')) > 5000, model_name='gpt4o', priority=50 ) ``` ### Integration and Testing Assemble the framework and run the test suite: ```python # Full integration test def test_framework_integration(): """End-to-end test of framework lifecycle.""" # 1. Create template with schemas summarizer = PromptTemplate( name='document_summarizer', template=SUMMARIZER_TEMPLATE, input_schema=DocumentInput, output_schema=SummaryOutput ) summarizer.register_model('claude', CLAUDE_CONFIG) summarizer.register_model('gpt4o', GPT4O_CONFIG) summarizer.register_model('deepseek', DEEPSEEK_CONFIG) # 2. Add routing rules router.add_rule( condition_fn=lambda t, i: i.get('modality') == 'technical', model_name='deepseek', priority=70 ) # 3. Evaluate across models harness = EvaluationHarness() results = harness.evaluate(summarizer, model_registry) assert results['summary']['mean'] > 0.85, "Accuracy below threshold" assert results['summary']['p5'] > 0.70, "Bottom 5% below acceptable" # 4. Deploy router.register(summarizer) deployment = DeploymentManager(router, monitor) deployment.deploy('document_summarizer', canary=0.05) return True # Run full test test_framework_integration() ```30 min