Principles are easy to nod along to and hard to apply, because real systems force trade-offs that no single chapter prepares you for. So before the capstone, three case studies — composite systems built from the patterns this series has covered, each facing a different shape of problem, each with a decision that went wrong and a fix that teaches. These aren't success stories; the interesting part of every real build is the thing that broke. Read them as rehearsals for the judgement calls you'll make on your own system.
Shape: 40,000 help-centre articles, changing daily, high query volume, conversational. The canonical RAG problem from Chapter 00. The team built sensibly: hybrid retrieval, a reranker, conversational reformulation, a licensed refusal, a semantic cache.
What went wrong: Users complained the assistant gave confidently outdated answers — quoting a refund window that had changed weeks earlier. Retrieval was fine; the cache was the culprit. They'd built the semantic cache from Chapter 15 for cost and speed, and it worked beautifully — too beautifully. A cached answer from before the policy change kept being served to every rephrasing of the question, long after the source article was updated. The cache had no invalidation tied to document changes.
The fix: Invalidate cached answers when their source documents change — exactly the warning from Chapter 15, learned the hard way. The lesson isn't "caching is dangerous"; it's that a cache without invalidation is a staleness bug wearing a performance win's clothing. The symptom (outdated answers) pointed at retrieval, but the cause was a production decision two layers away.
Shape: A company's internal docs plus its codebase, searched by engineers across several teams with different access. This system leaned on Chapter 13 (AST chunking for the code) and hybrid search (engineers search exact symbol names constantly).
What went wrong: Two failures, both instructive. First, early on, code retrieval returned garbage — half-functions, unrunnable fragments — because the first version chunked code by character count like the prose. AST chunking fixed it overnight, the single largest quality jump in the project. Second, and more serious: an engineer on one team retrieved a chunk from another team's restricted design doc. The access filter was applied after retrieval, so the restricted chunk was fetched, then usually filtered out — but a caching path served it before the filter ran.
The fix: Move the access filter into the retrieval query — pre-filter, per Chapter 16 — so restricted chunks are never candidates, and scope the cache by permission. The character-chunking miss cost weeks of mediocre results; the post-filter miss was a genuine data leak. Both were chapters the team had read and underweighted until the symptom forced the issue.
Shape: A legal team querying a corpus of contracts and policies. Low volume, but every answer is high-stakes and must be checkable. This system cared most about Chapter 09 — faithfulness, citations, refusal — and about the evaluation rigour of Chapter 11. Notably, the team considered graph RAG for "which clauses reference which" and correctly decided against it — their questions were overwhelmingly about clause content, not multi-hop relationships, so vectors plus decomposition sufficed.
What went wrong: The system occasionally answered with a confidently wrong number — a liability cap of "$1 million" when the contract said "$10 million." The contracts contained tables of figures, and the ingestion had linearised them into wordsoup, exactly the failure from Chapter 02. The retrieved chunk said something like "Liability cap 1 10 million parties" and the model guessed wrong about which number went with which clause.
The fix: Serialise tables into self-describing rows at ingestion. Suddenly the chunk read "Liability cap for Party A: $10 million" and the answers became correct and citable. For a high-stakes legal system, the data-prep chapter — the least glamorous in the series — was the one that mattered most. Their evaluation discipline is what caught it: a faithfulness eval flagged the wrong-number answers before a lawyer relied on one.
Three systems, three domains, and yet the failures rhyme:
My take. If these cases share one moral, it's that RAG failures are almost never where the excitement is. Nobody's system broke because they picked the wrong embedding model or didn't have a fancy enough reranker. They broke on a stale cache, a linearised table, a filter in the wrong place. The glamorous decisions get all the attention and cause almost none of the pain. Spend your worry budget on the plumbing.
For the system you're building or imagining, ask: which unglamorous stage am I underweighting? Data prep, cache invalidation, access filtering, freshness? Be honest — the one you're least excited about is the most likely to bite. Write it down; it's your pre-mortem.
Take a hypothetical complaint — "the answer was outdated" — and list every stage that could cause it: stale index, stale cache, retrieval missing the fresh chunk, generation ignoring it. Practising this decoupling of symptom from cause is the core debugging skill these cases demanded.
Name one technique from this series you will deliberately not build, and write the one-sentence reason. The legal team's "no graph, because our questions aren't multi-hop" is the model. Deliberate omission, justified by the problem shape, is mature engineering.
Next chapter: Capstone — a documentation QA system, end to end. The finale. We build one complete system from scratch — a documentation question-answering service — composing every stage of this series into a working whole, in three honest versions: the naive baseline, the measured improvement, and the production build.
Sign in to join the discussion and post comments.
Sign in