19. Prompt Evaluation
Chapter 19 of 25 · 15 min
Evaluating prompts requires metrics outside accuracy. A prompt may produce correct answers occasionally while being unreliable, slow, or brittle under input variation. Production evaluation tracks multiple dimensions.
EXERCISE
Build an evaluation harness for your most-used prompt. Create 50 test cases covering edge cases, run evaluation, and document which cases fail and why. Report p5 and p95 correctness alongside average.