Evaluation tells you how good one prompt is. A/B testing tells you which of two prompts is better. The technique borrows directly from web experimentation — but prompts have their own quirks, and several common A/B testing mistakes can leave you confidently shipping the worse prompt.
Imagine you have two versions of a customer-support prompt. Both score well in offline evaluation, but their scores are within 2 points of each other. Which one do you ship? Offline numbers cannot answer that — they were graded on a fixed, possibly stale set of inputs. The real test is which prompt produces better outcomes on live traffic, on real users, with real downstream effects (resolution rate, escalation rate, customer satisfaction).
This tutorial covers two complementary forms of A/B testing for prompts: offline paired tests on your eval set (fast, cheap, low risk), and online traffic splits on real users (slower, more expensive, but the only way to measure real-world outcomes).
An A/B test for prompts has the same four ingredients as any experiment: two variants (A and B), a population, a metric, and a stopping rule.
Most "A/B tests" in prompt work are actually casual eyeballing: an engineer runs both prompts on five examples and picks the one they like better. That is not an A/B test — it is taste-testing with a sample size of five. Half the time the "winner" is just noise.
Sample size of five
// Engineer:
// "Ran A and B on five tickets. B looks slightly better
// on two of them. Going with B."
With five samples, even a 60/40 split is well within what you would expect from a coin flip. Worse, the five examples were probably the ones the engineer happened to remember — already biased.
Two-stage A/B testing
# STAGE 1 — OFFLINE PAIRED TEST (fast, cheap, low risk)
# Run BOTH prompts on the same 100-input eval set.
# Score with your judge. Look at the per-input delta.
for input in eval_set:
out_a = model.run(promptA, input)
out_b = model.run(promptB, input)
score_a = judge(input, out_a)
score_b = judge(input, out_b)
deltas.append(score_b - score_a)
mean_delta, ci = bootstrap(deltas)
print(f"B − A: {mean_delta:+.2f} 95% CI: {ci}")
# Decision rule: only promote B to online test if
# mean_delta is positive AND the 95% CI excludes zero.
# STAGE 2 — ONLINE TRAFFIC SPLIT (slower, real users)
# Route 50% of users to A, 50% to B. Track the primary
# product metric (resolution rate, CSAT, conversion).
# Pre-register sample size and decision rule before
# starting. Don't peek and stop early.
The paired offline test handles 80% of the risk in 1% of the time. The online test then confirms (or refutes) the offline win in the real environment, where downstream metrics — the ones the business actually cares about — live.
Tip: Reserve a small "holdout" segment (1–5% of traffic) that always runs the current production prompt without experimentation. It gives you a clean baseline for long-term drift detection and an instant rollback target if a winning prompt later goes sideways.
Take an existing prompt and a small wording change. Run both against your 50-input eval set as a paired test. Compute the mean delta and a simple bootstrap confidence interval. Decide based on the CI, not the point estimate.
Design — on paper — an online A/B test for the same change. Write down the win metric, the sample size, the duration, the randomisation key, and the stopping rule. This document is the experiment plan; without it, you are guessing.
Pick a "winning" prompt change from the last six months in your project. Re-evaluate it today on current traffic. Was the win durable or did it decay? Most teams discover at least one phantom-win when they look back honestly.
Sign in to join the discussion and post comments.
Sign inPrompt Engineering for Data Science & Analytics
Supercharge your data workflows with AI. 15 practical tutorials on using prompt engineering for data cleaning, EDA, machine learning, SQL, visualisation, and more.
Prompt Engineering Projects & Real-World Applications
Twelve hands-on projects that turn prompt engineering theory into a portfolio. Build chatbots, content generators, RAG systems, and more.
Prompt Engineering for Developers
Use AI as your coding co-pilot. 18 tutorials on writing prompts to generate clean code, debug faster, write tests, build APIs, and ship better software.
Prompt Engineering for Image Generation
Turn words into stunning visuals. Master AI image generation tools like Midjourney, DALL·E 3, and Stable Diffusion with 18 focused tutorials — from first prompt to full brand identity.
Prompt Engineering for Specific AI Tools
Tool-by-tool mastery — deep dives into ChatGPT, Claude, Gemini, GitHub Copilot, Midjourney, Stable Diffusion, and more. Learn the exact prompting techniques each platform rewards.
Prompt Engineering for Education & Learning
Use AI as your personal tutor. Learn how to study faster, create lesson plans, generate practice questions, master languages, and prepare for competitive exams with smart prompts.