Every chapter in this series has ended with some version of "measure it." Measure your recall before tuning chunk size. Measure faithfulness before trusting your grounding prompt. Measure the lift before adding a reranker. This is the chapter that finally shows you how — because without evaluation, every other decision in RAG is a guess dressed up as engineering. The teams that ship reliable RAG and the teams that ship mystery boxes are separated almost entirely by this one discipline, and it is the discipline most tutorials skip because it's the least glamorous and the most work. It is also the one that pays off every single day after you build it.
A RAG system can fail in two fundamentally different places, and you must measure them separately because the fixes are different. Either retrieval failed — the right evidence never reached the model — or generation failed — the model had good evidence and still produced a bad answer. Lump them into one "accuracy" number and you'll spend weeks tuning the generator when the problem was retrieval, or vice versa. Evaluation's first job is to tell you which half is broken.
Retrieval metrics need a golden set: questions paired with the chunk (or chunks) that genuinely answer them. With that, four metrics cover almost everything, building on the recall and precision from Chapter 01:
| Metric | Answers | Use when |
|---|---|---|
| Recall@k | Did the right chunk make it into the top k? | The headline retrieval metric. If this is low, nothing downstream can save you. |
| Precision@k | What fraction of the top k were relevant? | Measuring noise in the context — too much junk costs tokens and distracts. |
| MRR | How high up was the first relevant chunk? | When the rank of the best chunk matters (it usually does for generation). |
| NDCG | Are the most relevant chunks ranked highest, graded? | When relevance is a spectrum, not yes/no, and ordering matters. |
For most teams, recall@k is where you start and where you spend most of your attention. It's the metric that told you, back in Chapter 07, whether reranking would help — a high recall@50 with a low recall@5 is the signature of a ranking problem, not a finding problem.
Generation metrics are harder because there's rarely one correct answer string to match against. Instead of exact matching, you measure the three relationships in the triad, and the dominant way to measure them in 2026 is to use an LLM as a judge — which we'll handle carefully in a moment. The key generation metric is faithfulness: the fraction of claims in the answer actually supported by the retrieved context, the measurable form of the grounding goal from Chapter 09. A low faithfulness score with high context relevance is the precise signature of a generation problem — good evidence, bad answer — and tells you to fix the prompt, not the retriever.
You do not need a heavy framework to start. Here is a complete harness measuring recall@k for retrieval and faithfulness via an LLM judge for generation, over a golden set. This is the artifact that turns "it seems better" into a number.
# A minimal RAG eval harness: retrieval recall + LLM-judged faithfulness.
# golden: list of {question, answer_chunk_ids}
def recall_at_k(golden, retrieve, k=5):
hits = 0
for item in golden:
got = [c.id for c in retrieve(item["question"], k=k)]
# a hit if ANY of the gold chunks for this question appears in top-k
if any(gid in got for gid in item["answer_chunk_ids"]):
hits += 1
return hits / len(golden)
JUDGE = """You are evaluating whether an answer is FAITHFUL to its context.
An answer is faithful if every factual claim in it is supported by the
context. Reply with a single number 0.0–1.0: the fraction of claims that
are supported. Reply with the number only.
CONTEXT:
{context}
ANSWER:
{answer}"""
def faithfulness(question, answer, chunks, judge_llm):
context = "\n\n".join(chunks)
raw = judge_llm.complete(JUDGE.format(context=context, answer=answer))
try:
return max(0.0, min(1.0, float(raw.strip())))
except ValueError:
return None # judge misbehaved; don't fake a score
def evaluate(golden, pipeline, judge_llm):
r = recall_at_k(golden, pipeline.retrieve, k=5)
faith_scores = []
for item in golden:
chunks = [c.text for c in pipeline.retrieve(item["question"], k=5)]
ans = pipeline.generate(item["question"], chunks)
f = faithfulness(item["question"], ans, chunks, judge_llm)
if f is not None:
faith_scores.append(f)
return {
"recall@5": round(r, 3),
"faithfulness": round(sum(faith_scores) / len(faith_scores), 3),
"n": len(golden),
}
print(evaluate(golden, pipeline, judge_llm))
{'recall@5': 0.86, 'faithfulness': 0.91, 'n': 50}
Now you can change one thing — chunk size, reranker, prompt — re-run, and see the number move. That loop, run on every change, is the entire difference between improving your system and rearranging it. Notice the judge's malformed-output handling: when the judge returns something unparseable, the code records nothing rather than inventing a score. A judge you trust blindly is worse than no judge.
The golden set feels like the hard part, and teams put it off forever, which is exactly why their evaluation never happens. It's less work than it looks if you're pragmatic:
This is the fifty-question set first mentioned in Chapter 04, now grown into the backbone of your whole evaluation. Build it once; it serves you for the life of the system.
Using an LLM to grade your RAG outputs is powerful and easy to do badly. The failure modes are well-known, and avoiding them is most of the skill:
My take. The single most valuable hour in a RAG project is the one where you build the eval harness, even a crude one. Not because the first numbers are impressive — they usually aren't — but because from that hour on, every decision becomes a measurement instead of an argument. Teams without an eval debate chunk sizes in meetings; teams with one run the experiment and move on. The harness pays for itself the first afternoon you'd otherwise have spent arguing.
Offline evaluation on a golden set catches regressions before you ship. But production shows you things a fixed set never will — the queries you didn't anticipate. Capture real signals: thumbs-up/down on answers, whether the user rephrased (a sign the answer missed), whether they clicked the citation, whether they escalated to a human. These signals are noisier than a golden set but they're real, and they tell you where to expand the golden set next. Offline and online evaluation are partners: offline prevents known regressions, online surfaces unknown ones.
If you've been putting it off through ten chapters of being told to, this is the moment. Fifty real questions, each tied to its answer chunk, covering easy cases, exact-token cases, compound cases, and a few unanswerable ones. This is the most valuable artifact you'll build in the whole project.
Drop the harness code onto your pipeline and get your first honest recall@5 and faithfulness numbers. They may be humbling. That's the point — now you have a baseline, and every change from here is measurable against it.
Pick any decision you've been making by intuition — chunk size, top-k, whether to add a reranker — and run it as an A/B against your golden set. Change one variable, measure, decide on the number. Feel the difference between engineering and guessing. That feeling is what this entire series has been building toward.
Wave 1 complete. You now have the whole core of RAG — from why it exists, through data prep, chunking, embeddings, the vector index, retrieval, reranking, query understanding, generation, conversation, and finally the evaluation discipline that holds it all honest. With these eleven chapters you can build a real RAG system and understand every decision inside it.
What's next: Wave 2 goes beyond the core — the specialised verticals (multi-modal RAG, RAG for code, graph RAG), production hardening (latency, cost, caching, security and compliance), an honest tour of the 2026 tooling, real-world case studies, and a full capstone that composes everything here into a working, deployable documentation Q&A system. The foundation you've built in Wave 1 is what everything in Wave 2 stands on.
Sign in to join the discussion and post comments.
Sign in