19. Prompt Evaluation

Chapter 19 of 25 · 15 min

Evaluating prompts requires metrics outside accuracy. A prompt may produce correct answers occasionally while being unreliable, slow, or brittle under input variation. Production evaluation tracks multiple dimensions.

EXERCISE

Build an evaluation harness for your most-used prompt. Create 50 test cases covering edge cases, run evaluation, and document which cases fail and why. Report p5 and p95 correctness alongside average.