On this tutorial

RAG: A Field Manual for Building LLM Systems That Use Your Data

Foundations

Data

Retrieval

Generation & Evaluation

Specialized verticals

Production

Closing

Evaluation — the part most series skip

Every chapter in this series has ended with some version of "measure it." Measure your recall before tuning chunk size. Measure faithfulness before trusting your grounding prompt. Measure the lift before adding a reranker. This is the chapter that finally shows you how — because without evaluation, every other decision in RAG is a guess dressed up as engineering. The teams that ship reliable RAG and the teams that ship mystery boxes are separated almost entirely by this one discipline, and it is the discipline most tutorials skip because it's the least glamorous and the most work. It is also the one that pays off every single day after you build it.

What you'll take away from this chapter

Why "it seems better" is the most expensive sentence in RAG, and what replaces it
The two halves of RAG evaluation — retrieval metrics and generation metrics
The RAG triad: faithfulness, answer relevance, context relevance — and what each catches
How to build a golden eval set without losing your weekend
How to use an LLM as a judge correctly, and the ways that go wrong

Two halves, because RAG has two failure points

A RAG system can fail in two fundamentally different places, and you must measure them separately because the fixes are different. Either retrieval failed — the right evidence never reached the model — or generation failed — the model had good evidence and still produced a bad answer. Lump them into one "accuracy" number and you'll spend weeks tuning the generator when the problem was retrieval, or vice versa. Evaluation's first job is to tell you which half is broken.

Three relationships, three metrics. Context relevance tests retrieval (did we fetch the right chunks?). Faithfulness tests grounding (does the answer stick to them?). Answer relevance tests the whole (did we actually address the question?). Together they localise any failure.

Retrieval metrics — did the right evidence arrive?

Retrieval metrics need a golden set: questions paired with the chunk (or chunks) that genuinely answer them. With that, four metrics cover almost everything, building on the recall and precision from Chapter 01:

Metric	Answers	Use when
Recall@k	Did the right chunk make it into the top k?	The headline retrieval metric. If this is low, nothing downstream can save you.
Precision@k	What fraction of the top k were relevant?	Measuring noise in the context — too much junk costs tokens and distracts.
MRR	How high up was the first relevant chunk?	When the rank of the best chunk matters (it usually does for generation).
NDCG	Are the most relevant chunks ranked highest, graded?	When relevance is a spectrum, not yes/no, and ordering matters.

For most teams, recall@k is where you start and where you spend most of your attention. It's the metric that told you, back in Chapter 07, whether reranking would help — a high recall@50 with a low recall@5 is the signature of a ranking problem, not a finding problem.

Generation metrics — was the answer good?

Generation metrics are harder because there's rarely one correct answer string to match against. Instead of exact matching, you measure the three relationships in the triad, and the dominant way to measure them in 2026 is to use an LLM as a judge — which we'll handle carefully in a moment. The key generation metric is faithfulness: the fraction of claims in the answer actually supported by the retrieved context, the measurable form of the grounding goal from Chapter 09. A low faithfulness score with high context relevance is the precise signature of a generation problem — good evidence, bad answer — and tells you to fix the prompt, not the retriever.

A small eval harness

You do not need a heavy framework to start. Here is a complete harness measuring recall@k for retrieval and faithfulness via an LLM judge for generation, over a golden set. This is the artifact that turns "it seems better" into a number.

# A minimal RAG eval harness: retrieval recall + LLM-judged faithfulness.
# golden: list of {question, answer_chunk_ids}
def recall_at_k(golden, retrieve, k=5):
    hits = 0
    for item in golden:
        got = [c.id for c in retrieve(item["question"], k=k)]
        # a hit if ANY of the gold chunks for this question appears in top-k
        if any(gid in got for gid in item["answer_chunk_ids"]):
            hits += 1
    return hits / len(golden)

JUDGE = """You are evaluating whether an answer is FAITHFUL to its context.
An answer is faithful if every factual claim in it is supported by the
context. Reply with a single number 0.0–1.0: the fraction of claims that
are supported. Reply with the number only.

CONTEXT:
{context}

ANSWER:
{answer}"""

def faithfulness(question, answer, chunks, judge_llm):
    context = "\n\n".join(chunks)
    raw = judge_llm.complete(JUDGE.format(context=context, answer=answer))
    try:
        return max(0.0, min(1.0, float(raw.strip())))
    except ValueError:
        return None                       # judge misbehaved; don't fake a score

def evaluate(golden, pipeline, judge_llm):
    r = recall_at_k(golden, pipeline.retrieve, k=5)
    faith_scores = []
    for item in golden:
        chunks = [c.text for c in pipeline.retrieve(item["question"], k=5)]
        ans = pipeline.generate(item["question"], chunks)
        f = faithfulness(item["question"], ans, chunks, judge_llm)
        if f is not None:
            faith_scores.append(f)
    return {
        "recall@5": round(r, 3),
        "faithfulness": round(sum(faith_scores) / len(faith_scores), 3),
        "n": len(golden),
    }

print(evaluate(golden, pipeline, judge_llm))

{'recall@5': 0.86, 'faithfulness': 0.91, 'n': 50}

Now you can change one thing — chunk size, reranker, prompt — re-run, and see the number move. That loop, run on every change, is the entire difference between improving your system and rearranging it. Notice the judge's malformed-output handling: when the judge returns something unparseable, the code records nothing rather than inventing a score. A judge you trust blindly is worse than no judge.

Building a golden set without losing your weekend

The golden set feels like the hard part, and teams put it off forever, which is exactly why their evaluation never happens. It's less work than it looks if you're pragmatic:

Fifty questions is enough to start. Not five thousand. Fifty real questions, each tied to its answer chunk, will catch most regressions and guide most decisions. You can grow it later.
Mine real sources. Support tickets, search logs, the questions colleagues actually ask. Real questions beat invented ones because they carry the messiness of Chapter 08.
Let an LLM draft, then a human verifies. Have a model propose question-and-answer-chunk pairs from your corpus, then a human confirms each. Drafting is the slow part; verification is fast, and the human check is what makes it golden.
Cover the failure modes deliberately. Include exact-token queries (for hybrid), compound questions (for decomposition), and questions your corpus doesn't answer (for refusal). A golden set of only easy questions measures only easy performance.

This is the fifty-question set first mentioned in Chapter 04, now grown into the backbone of your whole evaluation. Build it once; it serves you for the life of the system.

LLM-as-judge, done correctly

Using an LLM to grade your RAG outputs is powerful and easy to do badly. The failure modes are well-known, and avoiding them is most of the skill:

Score one narrow thing at a time. A judge asked for a single "quality" score gives noise. A judge asked "what fraction of claims are supported by the context?" gives signal. Decompose into faithfulness, relevance, and so on — never one vague number.
Give the judge a rubric, not a vibe. Define exactly what each score means. "1.0 = every claim supported; 0.5 = half supported" beats "rate the quality."
Beware position and verbosity bias. Judges favour the first option shown and longer answers. Randomise order when comparing; control for length.
Validate the judge against humans. Periodically have a human grade a sample and check the judge agrees. A judge that diverges from human judgement is measuring something, but not what you think.
Use a strong model as judge. Judging is harder than answering. A weak judge gives confident, wrong grades.

My take. The single most valuable hour in a RAG project is the one where you build the eval harness, even a crude one. Not because the first numbers are impressive — they usually aren't — but because from that hour on, every decision becomes a measurement instead of an argument. Teams without an eval debate chunk sizes in meetings; teams with one run the experiment and move on. The harness pays for itself the first afternoon you'd otherwise have spent arguing.

Evaluating in production

Offline evaluation on a golden set catches regressions before you ship. But production shows you things a fixed set never will — the queries you didn't anticipate. Capture real signals: thumbs-up/down on answers, whether the user rephrased (a sign the answer missed), whether they clicked the citation, whether they escalated to a human. These signals are noisier than a golden set but they're real, and they tell you where to expand the golden set next. Offline and online evaluation are partners: offline prevents known regressions, online surfaces unknown ones.

When this fails

No golden set, ever. The most common failure: evaluation stays a someday task and the system is tuned by vibes forever. Fifty questions this week beats five thousand never.
One blended "accuracy" number. Merging retrieval and generation into a single score hides which half is broken. Always measure the two stages separately.
Trusting the judge blindly. An unvalidated LLM judge can drift from human judgement and quietly mislead every decision. Validate against humans periodically; handle malformed judge output rather than faking scores.
A too-easy golden set. If your eval contains only questions the system answers well, it measures nothing useful. Deliberately include hard cases, exact-token queries, and unanswerable questions.
Optimising the metric instead of the system. Chasing a number until you've overfit your golden set produces a system that aces the eval and fails users. Refresh the golden set with real production queries to keep it honest.
Measuring once. An eval run once and forgotten is a snapshot, not a discipline. The value is in running it on every change — that's what makes it a ratchet rather than a photograph.

Practice — graduating from this chapter

Build the fifty-question golden set

If you've been putting it off through ten chapters of being told to, this is the moment. Fifty real questions, each tied to its answer chunk, covering easy cases, exact-token cases, compound cases, and a few unanswerable ones. This is the most valuable artifact you'll build in the whole project.

Run the harness on your system

Drop the harness code onto your pipeline and get your first honest recall@5 and faithfulness numbers. They may be humbling. That's the point — now you have a baseline, and every change from here is measurable against it.

Run one real experiment

Pick any decision you've been making by intuition — chunk size, top-k, whether to add a reranker — and run it as an A/B against your golden set. Change one variable, measure, decide on the number. Feel the difference between engineering and guessing. That feeling is what this entire series has been building toward.

Takeaways

"It seems better" is the most expensive sentence in RAG. Replace it with a number from a golden set.
Measure retrieval and generation separately — they fail differently and are fixed differently. The RAG triad (context relevance, faithfulness, answer relevance) localises the failure.
Recall@k is the headline retrieval metric; faithfulness is the headline generation metric. Start there.
Fifty real, verified questions is enough to begin. Mine real sources, let an LLM draft and a human verify, and cover the hard cases deliberately.
LLM-as-judge works if you score one narrow thing at a time, give it a rubric, control for bias, and validate against humans. Never trust an unvalidated judge.
Pair offline evaluation (catches known regressions) with production signals (surface unknown ones). Run it on every change, not once.

Wave 1 complete. You now have the whole core of RAG — from why it exists, through data prep, chunking, embeddings, the vector index, retrieval, reranking, query understanding, generation, conversation, and finally the evaluation discipline that holds it all honest. With these eleven chapters you can build a real RAG system and understand every decision inside it.

What's next: Wave 2 goes beyond the core — the specialised verticals (multi-modal RAG, RAG for code, graph RAG), production hardening (latency, cost, caching, security and compliance), an honest tour of the 2026 tooling, real-world case studies, and a full capstone that composes everything here into a working, deployable documentation Q&A system. The foundation you've built in Wave 1 is what everything in Wave 2 stands on.

Discussion

Conversational RAG — multi-turn, state, follow-ups Multi-modal RAG — images, video, audio