Prompt Evaluation: How to Measure and Score Prompt Quality

"It looks better to me" is not an engineering claim. Prompt evaluation is how you tell whether a change actually improved things, kept them the same, or quietly broke them. The teams that ship reliable AI are the teams that built an evaluation habit early.

1. Introduction

Prompts are easy to write and surprisingly hard to measure. The same prompt can look perfect on the three examples you tried by hand and fail systematically on a fourth case you never thought of. Worse, the failure can be slow — a prompt that worked great in March quietly degrades by July because the kind of inputs users send has shifted. Without measurement, you discover the regression from a customer complaint.

This tutorial covers the four building blocks of prompt evaluation: the eval set, the metric, the scoring method, and the cadence at which you re-run everything. None of these are exotic. The bar for "good enough" is much lower than people expect — but having something is dramatically better than having nothing.

2. The Concept Explained

An evaluation is just running your prompt against a known set of inputs and scoring the outputs. The three big design choices are what to measure, who measures it, and what counts as good.

The four pillars

Eval set. 30–200 representative inputs, ideally drawn from real production traffic. Each input is paired with the "right" answer or with the criteria for a right answer.
Metric. Exact match for classification, ROUGE/BLEU for text similarity, schema-validity for structured outputs, rubric scores for open-ended work.
Scorer. Either deterministic code (when the right answer is checkable), human graders (slow and expensive but the gold standard), or LLM-as-judge (cheap, fast, surprisingly reliable when prompted well).
Cadence. Run on every prompt change, on every model upgrade, and on a calendar (weekly or monthly) to catch drift.

A prompt evaluation grid: each input flows through the prompt; outputs are scored on multiple metrics; the grid is compared to a baseline.

3. The Problem Without Evaluation

Without evaluation, every prompt change is an act of faith. You make a tweak that feels better on three examples, ship it, and either (a) something quietly regresses for a different class of input that you don't currently look at, or (b) it really did improve things, but you cannot prove it and the next reviewer rolls it back.

"Vibes-based" iteration

// Engineer in chat:
// "I changed the prompt to add 'be concise'. Looks better
//  in the three tickets I tried. Shipping it."

Six weeks later, customer support notices the assistant has started cutting off important details in long tickets. Nobody connects it to the "be concise" change. The fix takes a week of detective work.

4. The Solution: A Lightweight Eval Harness

LLM-as-judge rubric (excerpt)

You are grading a customer-support assistant's replies.
For each (ticket, reply) pair, score on three axes:

1. Correctness (1-5)
   5 = facts and policy match the ticket exactly.
   1 = the reply contradicts the ticket or invents policy.

2. Completeness (1-5)
   5 = every customer question is addressed.
   1 = key questions are ignored.

3. Tone (1-5)
   5 = warm, professional, empathetic.
   1 = curt, dismissive, robotic.

Return JSON: {
  "correctness": int, "completeness": int, "tone": int,
  "reasoning": "one short paragraph"
}

Ticket: """{{ticket}}"""
Reply:  """{{reply}}"""

Run this judge prompt across your eval set. Average the scores per axis. Compare to the baseline for the previous prompt version. Ship only when scores hold or improve.

5. Step-by-Step Breakdown

Build the eval set from real traffic. Sample 30–200 inputs you actually saw in production. Add a handful of "evil" inputs: malformed, multilingual, ambiguous. The richer the set, the more it catches.
Define the metric per task. Classification → accuracy. Extraction → field-level precision and recall. Open-ended writing → rubric scores. Many tasks need 2–3 metrics in combination.
Pick a scorer per metric. Use deterministic code wherever possible — it is free and exactly repeatable. Use LLM-as-judge for the parts code cannot grade. Reserve humans for the highest-stakes outputs and as a periodic sanity check on your LLM judge.
Calibrate the LLM judge. Grade 50 outputs by hand. Then run the LLM judge on the same outputs. Adjust the judge's prompt until its scores agree with yours roughly 80% of the time. A miscalibrated judge is worse than no judge.
Track multiple axes. A prompt that improves accuracy but doubles latency and triples token cost is not actually better. Always look at accuracy, validity, latency, and cost together.
Run on every change. Wire evaluation into your prompt-library workflow. No prompt change ships without a green eval run.
Re-run on a calendar. Model providers update underlying models. Customer inputs evolve. Quarterly re-evaluation catches drift before users do.

Tip: Keep the eval set small enough that you actually run it. A 50-input set that runs every PR beats a 5,000-input set that runs once a quarter. You can always grow the set later; you cannot retro-actively add the months of discipline you skipped.

6. Practice Exercises

Exercise 1

For one prompt in your stack, build a 30-input eval set by sampling real (anonymised) traffic. For each, write down what the right output should look like. This act of labelling is half of evaluation.

Exercise 2

Write an LLM-as-judge prompt with a clear 1–5 rubric on at least one axis. Grade ten outputs by hand. Run the judge on the same ten. Compare. Tune the rubric until human and judge agree most of the time.

Exercise 3

Build a script that runs your prompt against the eval set, scores each output, and prints a one-line summary ("accuracy 0.84, validity 1.00, p95 latency 1.7s, cost £0.04/run"). Aim for "one command, one number to look at" — that is what gets used in practice.

7. Key Takeaways

"It looks better" is not evidence. A real eval set, a real metric, and a real baseline are what tell you whether a prompt change is an improvement.
The four pillars are: eval set, metric, scorer, cadence. Skipping any of them undermines the whole thing.
LLM-as-judge is cheap, fast, and usable once calibrated against human grades. Calibration is the step most teams skip.
Always evaluate multiple axes — accuracy alone hides regressions in latency, cost, and tone.
Small eval sets that you actually run beat enormous ones that gather dust.

Discussion

Building Prompt Libraries and Reusable Prompt Templates A/B Testing Your Prompts for Better Outputs