RAG in the wild — three case studies

Principles are easy to nod along to and hard to apply, because real systems force trade-offs that no single chapter prepares you for. So before the capstone, three case studies — composite systems built from the patterns this series has covered, each facing a different shape of problem, each with a decision that went wrong and a fix that teaches. These aren't success stories; the interesting part of every real build is the thing that broke. Read them as rehearsals for the judgement calls you'll make on your own system.

What you'll take away from this chapter

How the same toolbox produces three very different systems depending on the problem shape
The specific decision each team got wrong, and the symptom that revealed it
Why the fix was usually a chapter they'd skipped, not a cleverer technique
The cross-cutting lessons that hold across all three

Case 1 — The support assistant

Shape: 40,000 help-centre articles, changing daily, high query volume, conversational. The canonical RAG problem from Chapter 00. The team built sensibly: hybrid retrieval, a reranker, conversational reformulation, a licensed refusal, a semantic cache.

What went wrong: Users complained the assistant gave confidently outdated answers — quoting a refund window that had changed weeks earlier. Retrieval was fine; the cache was the culprit. They'd built the semantic cache from Chapter 15 for cost and speed, and it worked beautifully — too beautifully. A cached answer from before the policy change kept being served to every rephrasing of the question, long after the source article was updated. The cache had no invalidation tied to document changes.

The fix: Invalidate cached answers when their source documents change — exactly the warning from Chapter 15, learned the hard way. The lesson isn't "caching is dangerous"; it's that a cache without invalidation is a staleness bug wearing a performance win's clothing. The symptom (outdated answers) pointed at retrieval, but the cause was a production decision two layers away.

Case 2 — Internal engineering knowledge

Shape: A company's internal docs plus its codebase, searched by engineers across several teams with different access. This system leaned on Chapter 13 (AST chunking for the code) and hybrid search (engineers search exact symbol names constantly).

What went wrong: Two failures, both instructive. First, early on, code retrieval returned garbage — half-functions, unrunnable fragments — because the first version chunked code by character count like the prose. AST chunking fixed it overnight, the single largest quality jump in the project. Second, and more serious: an engineer on one team retrieved a chunk from another team's restricted design doc. The access filter was applied after retrieval, so the restricted chunk was fetched, then usually filtered out — but a caching path served it before the filter ran.

The fix: Move the access filter into the retrieval query — pre-filter, per Chapter 16 — so restricted chunks are never candidates, and scope the cache by permission. The character-chunking miss cost weeks of mediocre results; the post-filter miss was a genuine data leak. Both were chapters the team had read and underweighted until the symptom forced the issue.

Case 3 — Contract and policy Q&A

Shape: A legal team querying a corpus of contracts and policies. Low volume, but every answer is high-stakes and must be checkable. This system cared most about Chapter 09 — faithfulness, citations, refusal — and about the evaluation rigour of Chapter 11. Notably, the team considered graph RAG for "which clauses reference which" and correctly decided against it — their questions were overwhelmingly about clause content, not multi-hop relationships, so vectors plus decomposition sufficed.

What went wrong: The system occasionally answered with a confidently wrong number — a liability cap of "$1 million" when the contract said "$10 million." The contracts contained tables of figures, and the ingestion had linearised them into wordsoup, exactly the failure from Chapter 02. The retrieved chunk said something like "Liability cap 1 10 million parties" and the model guessed wrong about which number went with which clause.

The fix: Serialise tables into self-describing rows at ingestion. Suddenly the chunk read "Liability cap for Party A: $10 million" and the answers became correct and citable. For a high-stakes legal system, the data-prep chapter — the least glamorous in the series — was the one that mattered most. Their evaluation discipline is what caught it: a faithfulness eval flagged the wrong-number answers before a lawyer relied on one.

The same set of techniques, weighted differently by problem shape. Notice the pattern in red: every failure traced to a chapter the team had read but not taken seriously enough — usually an unglamorous one (caching invalidation, data prep, access filtering).

The cross-cutting lessons

Three systems, three domains, and yet the failures rhyme:

The unglamorous chapters are the ones that bite. Data prep, cache invalidation, access filtering — not embeddings or rerankers — caused every failure here. The boring stages are where production breaks.
The symptom rarely points at the cause. "Outdated answers" looked like retrieval but was caching. "Wrong numbers" looked like the model but was data prep. Diagnose across the whole pipeline, not just the stage that's visibly misbehaving.
Evaluation is what caught the problems. The legal team's faithfulness eval flagged wrong numbers before harm; teams without an eval discover failures through angry users. The discipline from Chapter 11 is the safety net under everything else.
Restraint was a feature. The legal team's best decision was not building a graph. Knowing which techniques to skip is as valuable as knowing which to use.

My take. If these cases share one moral, it's that RAG failures are almost never where the excitement is. Nobody's system broke because they picked the wrong embedding model or didn't have a fancy enough reranker. They broke on a stale cache, a linearised table, a filter in the wrong place. The glamorous decisions get all the attention and cause almost none of the pain. Spend your worry budget on the plumbing.

When this fails

Diagnosing only the visible stage. The symptom misleads. Build observability across the whole pipeline — what was retrieved, what was cached, what the prompt contained — so you can find the real cause, not the apparent one.
Skipping the boring chapters. Teams pour attention into retrieval tuning and underinvest in data prep, caching, and access control — then break on exactly those. Give the unglamorous stages their due before launch.
No eval to catch regressions. Every fix above was validated by measurement. Without an eval, you ship the bug and learn about it from users — or from an auditor.
Adding techniques to look sophisticated. The legal team would have wasted a quarter building a graph they didn't need. Match technique to problem shape; restraint is an engineering decision.

Practice — before the capstone

Predict your own failure

For the system you're building or imagining, ask: which unglamorous stage am I underweighting? Data prep, cache invalidation, access filtering, freshness? Be honest — the one you're least excited about is the most likely to bite. Write it down; it's your pre-mortem.

Trace a symptom to its cause

Take a hypothetical complaint — "the answer was outdated" — and list every stage that could cause it: stale index, stale cache, retrieval missing the fresh chunk, generation ignoring it. Practising this decoupling of symptom from cause is the core debugging skill these cases demanded.

Justify one omission

Name one technique from this series you will deliberately not build, and write the one-sentence reason. The legal team's "no graph, because our questions aren't multi-hop" is the model. Deliberate omission, justified by the problem shape, is mature engineering.

Takeaways

The same RAG toolbox produces very different systems depending on problem shape — support leaned on freshness and conversation, engineering on AST chunking and access control, legal on faithfulness and data prep.
Every failure traced to an unglamorous chapter the team underweighted: stale cache, character-chunked code, post-filtered access, linearised tables.
The symptom rarely points at the cause. Diagnose across the whole pipeline with real observability.
Evaluation is the safety net — it caught every problem before users (or auditors) did.
Deliberately not building a technique (the legal team's graph) can be the best decision. Restraint matched to problem shape is mature engineering.

Next chapter: Capstone — a documentation QA system, end to end. The finale. We build one complete system from scratch — a documentation question-answering service — composing every stage of this series into a working whole, in three honest versions: the naive baseline, the measured improvement, and the production build.

Discussion

Tooling — the 2026 honest tour Capstone — a documentation QA system, end to end