"It looks better to me" is not an engineering claim. Prompt evaluation is how you tell whether a change actually improved things, kept them the same, or quietly broke them. The teams that ship reliable AI are the teams that built an evaluation habit early.
Prompts are easy to write and surprisingly hard to measure. The same prompt can look perfect on the three examples you tried by hand and fail systematically on a fourth case you never thought of. Worse, the failure can be slow — a prompt that worked great in March quietly degrades by July because the kind of inputs users send has shifted. Without measurement, you discover the regression from a customer complaint.
This tutorial covers the four building blocks of prompt evaluation: the eval set, the metric, the scoring method, and the cadence at which you re-run everything. None of these are exotic. The bar for "good enough" is much lower than people expect — but having something is dramatically better than having nothing.
An evaluation is just running your prompt against a known set of inputs and scoring the outputs. The three big design choices are what to measure, who measures it, and what counts as good.
Without evaluation, every prompt change is an act of faith. You make a tweak that feels better on three examples, ship it, and either (a) something quietly regresses for a different class of input that you don't currently look at, or (b) it really did improve things, but you cannot prove it and the next reviewer rolls it back.
"Vibes-based" iteration
// Engineer in chat:
// "I changed the prompt to add 'be concise'. Looks better
// in the three tickets I tried. Shipping it."
Six weeks later, customer support notices the assistant has started cutting off important details in long tickets. Nobody connects it to the "be concise" change. The fix takes a week of detective work.
LLM-as-judge rubric (excerpt)
You are grading a customer-support assistant's replies.
For each (ticket, reply) pair, score on three axes:
1. Correctness (1-5)
5 = facts and policy match the ticket exactly.
1 = the reply contradicts the ticket or invents policy.
2. Completeness (1-5)
5 = every customer question is addressed.
1 = key questions are ignored.
3. Tone (1-5)
5 = warm, professional, empathetic.
1 = curt, dismissive, robotic.
Return JSON: {
"correctness": int, "completeness": int, "tone": int,
"reasoning": "one short paragraph"
}
Ticket: """{{ticket}}"""
Reply: """{{reply}}"""
Run this judge prompt across your eval set. Average the scores per axis. Compare to the baseline for the previous prompt version. Ship only when scores hold or improve.
Tip: Keep the eval set small enough that you actually run it. A 50-input set that runs every PR beats a 5,000-input set that runs once a quarter. You can always grow the set later; you cannot retro-actively add the months of discipline you skipped.
For one prompt in your stack, build a 30-input eval set by sampling real (anonymised) traffic. For each, write down what the right output should look like. This act of labelling is half of evaluation.
Write an LLM-as-judge prompt with a clear 1–5 rubric on at least one axis. Grade ten outputs by hand. Run the judge on the same ten. Compare. Tune the rubric until human and judge agree most of the time.
Build a script that runs your prompt against the eval set, scores each output, and prints a one-line summary ("accuracy 0.84, validity 1.00, p95 latency 1.7s, cost £0.04/run"). Aim for "one command, one number to look at" — that is what gets used in practice.
Sign in to join the discussion and post comments.
Sign inPrompt Engineering for Developers
Use AI as your coding co-pilot. 18 tutorials on writing prompts to generate clean code, debug faster, write tests, build APIs, and ship better software.
Prompt Engineering Projects & Real-World Applications
Twelve hands-on projects that turn prompt engineering theory into a portfolio. Build chatbots, content generators, RAG systems, and more.
Prompt Engineering for Education & Learning
Use AI as your personal tutor. Learn how to study faster, create lesson plans, generate practice questions, master languages, and prepare for competitive exams with smart prompts.
Prompt Engineering for Data Science & Analytics
Supercharge your data workflows with AI. 15 practical tutorials on using prompt engineering for data cleaning, EDA, machine learning, SQL, visualisation, and more.
Prompt Engineering for Image Generation
Turn words into stunning visuals. Master AI image generation tools like Midjourney, DALL·E 3, and Stable Diffusion with 18 focused tutorials — from first prompt to full brand identity.
Prompt Engineering for Content & Copywriting
Write blogs, ads, emails, and social media content ten times faster with AI. 13 practical tutorials on prompt engineering for content creators and copywriters.