On this tutorial

RAG: A Field Manual for Building LLM Systems That Use Your Data

Foundations

Data

Retrieval

Generation & Evaluation

Specialized verticals

Production

Closing

Capstone — a documentation QA system, end to end

Nineteen chapters of parts. Now we assemble the whole machine. We're going to build one complete system — a documentation question-answering service, the most common RAG application there is — and we'll build it three times: a naive baseline that works and underwhelms, a measured improvement that earns every change with a number, and a production version hardened for the real world. Building it three times is the entire point. It's how you internalise the series' deepest lesson: RAG is not a technique you apply once, it's a loop you run — build, measure, improve — and the discipline of that loop matters more than any single trick inside it.

What you'll take away from this chapter

A complete, end-to-end RAG system you can read in one sitting
The naive baseline — the smallest thing that works — and its honest numbers
The measured improvement — each upgrade justified by a metric, not a vibe
The production hardening — what turns a good pipeline into a shippable service
The closing argument of the whole series, made concrete in one build

The progression

Three versions, each adding only what the previous version's measurements showed it needed. This is the shape of every real RAG project worth respecting.

V1 proves the concept and gives you a baseline. V2 spends effort where V1's metrics showed weakness — the recall and faithfulness jumps are earned, not assumed. V3 adds nothing to quality; it adds everything to survivability. The numbers are illustrative, but the shape is universal.

Version 1 — the naive baseline

The smallest complete system. Resist every urge to be clever; you're building a reference point, and a baseline you over-engineer is no baseline at all. This is the straight line from Chapter 01: embed, retrieve, generate.

# V1 — naive documentation QA. The whole thing, end to end.
import psycopg
from pgvector.psycopg import register_vector
from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer("BAAI/bge-base-en-v1.5")
conn = psycopg.connect("postgresql://localhost/docsqa")
register_vector(conn)

def setup():
    conn.execute("CREATE EXTENSION IF NOT EXISTS vector")
    conn.execute("""CREATE TABLE IF NOT EXISTS chunks (
        id bigserial PRIMARY KEY, doc_id text, content text,
        embedding vector(768))""")
    conn.execute("""CREATE INDEX IF NOT EXISTS chunks_hnsw ON chunks
        USING hnsw (embedding vector_cosine_ops)""")

def chunk(text, size=800):                       # recursive-ish, simplified
    paras, out, buf = text.split("\n\n"), [], ""
    for p in paras:
        if len(buf) + len(p) <= size: buf += "\n\n" + p
        else: out.append(buf.strip()); buf = p
    if buf: out.append(buf.strip())
    return [c for c in out if c]

def ingest(doc_id, text):
    for c in chunk(text):
        v = embedder.encode(c, normalize_embeddings=True)
        conn.execute("INSERT INTO chunks (doc_id, content, embedding) "
                     "VALUES (%s,%s,%s)", (doc_id, c, v))
    conn.commit()

def answer(question, k=5):
    q = embedder.encode(question, normalize_embeddings=True)
    rows = conn.execute("SELECT content FROM chunks ORDER BY embedding <=> %s "
                        "LIMIT %s", (q, k)).fetchall()
    context = "\n\n".join(r[0] for r in rows)
    prompt = f"Answer using the context.\n\nContext:\n{context}\n\nQ: {question}"
    return llm.complete(prompt)

setup(); ingest("guide", open("guide.md").read())
print(answer("how do I configure logging?"))

It works. It also disappoints, predictably: on a fifty-question eval set (built per Chapter 11) it scores recall@5 around 0.71 and faithfulness around 0.78 — it misses answers and occasionally drifts past the evidence. Do not skip measuring V1. Those mediocre numbers are the most valuable thing you have right now: they're the baseline that will tell you whether every later change actually helped.

Version 2 — the measured improvement

Now we improve, and the rule is absolute: every change is justified by a number on the eval set, applied one at a time so you know what did what. Looking at V1's failures, four upgrades address what the metrics flagged:

Structural chunking (Chapter 03) — docs have headings; chunk on them so each chunk is a coherent section. Recall climbs.
Hybrid + RRF + reranking (Chapter 06, 07) — docs are full of exact config keys and error codes that pure vectors miss; the keyword side catches them, the reranker orders them. Recall and precision both climb.
Query reformulation (Chapter 10) — users ask follow-ups; reformulate to standalone questions so retrieval works mid-conversation.
Grounded generation with citations and refusal (Chapter 09) — the biggest faithfulness lever: bind every claim to a cited chunk and license "the docs don't cover this." Faithfulness jumps.

# V2 — the retrieve step, now hybrid + reranked (replaces V1's answer()).
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

GROUNDED = """Answer only from the numbered context. Cite each claim as [n].
If the context lacks the answer, say "The docs don't cover this." Be concise."""

def answer_v2(question, history=None):
    q = reformulate(history, question) if history else question   # Ch10
    candidates = hybrid_search(q, top_n=40)        # Ch06: vector + BM25 + RRF
    top = rerank(q, candidates, final_k=5)         # Ch07: cross-encoder
    context = "\n\n".join(f"[{i+1}] {c}" for i, (c, _) in enumerate(top))
    return llm.complete(f"{GROUNDED}\n\nContext:\n{context}\n\nQ: {q}")

Re-run the eval after each change and watch the numbers move — recall@5 to about 0.91, faithfulness to about 0.94. Crucially, you now know which change bought which gain, because you measured one at a time. If structural chunking had moved nothing, you'd have dropped it. That discipline — change, measure, keep or discard — is V2's real lesson, more than any individual upgrade.

Version 3 — the production build

V2 is a good system. V3 is a shippable one, and the difference is everything from the production wave. Note that V3 adds essentially nothing to the quality metrics — its job is survivability, not accuracy:

Semantic cache (Chapter 15) — the repetitive majority of doc questions answered free and instant, with invalidation tied to document changes.
Incremental indexing (Chapter 15) — when a doc page changes, re-chunk and re-embed only that page, deleting its old chunks so stale and fresh never coexist.
Access pre-filter (Chapter 16) — if some docs are internal, filter by permission inside the query, and scope the cache by access.
Streaming + observability — stream the answer so perceived latency collapses, and log what was retrieved, cached, and generated so you can diagnose across the whole pipeline (the skill Chapter 18 demanded).
Injection hygiene (Chapter 16) — treat retrieved doc text as data, not instructions.

# V3 — the request path, composing cache + access + the V2 core.
def answer_v3(question, user, history=None, stream=True):
    cache_key = (question, frozenset(user.roles))   # scope cache by access!
    if (hit := cache.get(cache_key)) is not None:
        return hit                                   # free, instant
    q = reformulate(history, question) if history else question
    # access filter lives INSIDE retrieval (pre-filter, Ch16)
    candidates = hybrid_search(q, top_n=40, acl=user.roles)
    top = rerank(q, candidates, final_k=5)
    context = "\n\n".join(f"[{i+1}] {c}" for i, (c, _) in enumerate(top))
    answer = llm.complete(f"{GROUNDED}\n\nContext:\n{context}\n\nQ: {q}",
                          stream=stream)
    cache.put(cache_key, answer)
    return answer

Same answers as V2, but now cheap, fast, fresh, secure, and debuggable. That gap between "good in a notebook" and "shippable to users" is the production wave in one diff — and it's the gap most demos never cross.

What you've actually built — across the whole series

Step back. Across twenty chapters you went from "why does my LLM not know my data" to a system that ingests messy real documents, chunks them intelligently, embeds and indexes them, retrieves with hybrid search and reranking, understands and reformulates queries, generates grounded and cited answers that refuse when the evidence is thin, holds a conversation, is measured at every step, and is hardened for cost, latency, freshness, security, and scale. You can also reason about the specialised cases — multi-modal, code, graph — and choose tools without the marketing. That's not a tutorial's worth of knowledge. That's a practitioner's.

My take — and the whole series in one paragraph. The techniques in this manual will date; the models will change, the tools will churn, a 2027 reader will smile at some 2026 detail. What won't date is the loop: build the simplest thing, measure it honestly, improve only what the measurement says is weak, and respect the unglamorous stages where production actually breaks. Every chapter was an instance of that loop. If you take nothing else, take the loop. It's the difference between people who have opinions about RAG and people who build RAG that works — and you're now firmly in the second group.

When this fails

Skipping V1. Teams that start at V2's sophistication never get a baseline and can't tell whether their complexity helped. Always build and measure the naive version first.
Adding V2 upgrades in a batch. Apply four changes at once and you learn nothing about which one mattered — and you'll carry dead weight forever. One change, one measurement.
Shipping V2 as if it were V3. A system with great metrics and no cache, no access control, no observability falls over or leaks on contact with real traffic. Quality and shippability are different axes.
Treating V3 as the end. Production is where measurement starts mattering most — real queries reveal what your eval set didn't. The loop keeps running after launch.

Practice — your own capstone

Build V1 for real

Take a documentation set you care about — a project's docs, an API reference — and build the V1 baseline. Get it answering questions, then build your fifty-question eval set and measure it. You now have a real system and a real number. Everything else is improvement you can prove.

Earn one V2 upgrade

Pick the single upgrade your V1 numbers most call for — usually hybrid search or grounded generation — apply just that one, and re-measure. Feel the difference between "I added a reranker" and "the reranker raised recall@5 from 0.71 to 0.79." The second sentence is engineering.

Find your V3 gap

For your system, list what stands between V2 and shippable: cache, freshness, access, observability, latency. Order them by what your actual deployment needs. That ordered list is your road to production — and you now have a chapter for every item on it.

Takeaways

Build the system three times: naive baseline (a reference point), measured improvement (each change earned by a metric), production hardening (survivability, not accuracy).
Never skip V1 or its measurement — the baseline is what makes every later gain provable.
In V2, change one thing at a time and re-measure. The discipline of the loop matters more than any single upgrade.
V3 adds little to quality and everything to shippability: cache, incremental freshness, access pre-filter, streaming, observability, injection hygiene.
Techniques date; the loop doesn't. Build simple, measure honestly, improve what's weak, respect the unglamorous stages. That's the whole field manual.

That's the manual. You started at "why RAG, and why it didn't die with long context," and you finish with a complete, measured, production-shaped system and the judgement to adapt it to any corpus you meet. The rest is practice — go build something, measure it honestly, and improve it one provable step at a time.

If you want the companion discipline of giving these systems the ability to act rather than only answer, the Agentic SDLC field manual picks up exactly there. And if you ever need to revisit a stage, the chapters are arranged to be re-read out of order — this was always meant to be a manual you keep on the shelf, not a book you read once. Thank you for building alongside it.

Discussion

RAG in the wild — three case studies

Capstone — a documentation QA system, end to end

What you'll take away from this chapter

A complete, end-to-end RAG system you can read in one sitting
The naive baseline — the smallest thing that works — and its honest numbers
The measured improvement — each upgrade justified by a metric, not a vibe
The production hardening — what turns a good pipeline into a shippable service
The closing argument of the whole series, made concrete in one build

The progression

Three versions, each adding only what the previous version's measurements showed it needed. This is the shape of every real RAG project worth respecting.

Version 1 — the naive baseline

# V1 — naive documentation QA. The whole thing, end to end.
import psycopg
from pgvector.psycopg import register_vector
from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer("BAAI/bge-base-en-v1.5")
conn = psycopg.connect("postgresql://localhost/docsqa")
register_vector(conn)

def setup():
    conn.execute("CREATE EXTENSION IF NOT EXISTS vector")
    conn.execute("""CREATE TABLE IF NOT EXISTS chunks (
        id bigserial PRIMARY KEY, doc_id text, content text,
        embedding vector(768))""")
    conn.execute("""CREATE INDEX IF NOT EXISTS chunks_hnsw ON chunks
        USING hnsw (embedding vector_cosine_ops)""")

def chunk(text, size=800):                       # recursive-ish, simplified
    paras, out, buf = text.split("\n\n"), [], ""
    for p in paras:
        if len(buf) + len(p) <= size: buf += "\n\n" + p
        else: out.append(buf.strip()); buf = p
    if buf: out.append(buf.strip())
    return [c for c in out if c]

def ingest(doc_id, text):
    for c in chunk(text):
        v = embedder.encode(c, normalize_embeddings=True)
        conn.execute("INSERT INTO chunks (doc_id, content, embedding) "
                     "VALUES (%s,%s,%s)", (doc_id, c, v))
    conn.commit()

def answer(question, k=5):
    q = embedder.encode(question, normalize_embeddings=True)
    rows = conn.execute("SELECT content FROM chunks ORDER BY embedding <=> %s "
                        "LIMIT %s", (q, k)).fetchall()
    context = "\n\n".join(r[0] for r in rows)
    prompt = f"Answer using the context.\n\nContext:\n{context}\n\nQ: {question}"
    return llm.complete(prompt)

setup(); ingest("guide", open("guide.md").read())
print(answer("how do I configure logging?"))

Version 2 — the measured improvement

Structural chunking (Chapter 03) — docs have headings; chunk on them so each chunk is a coherent section. Recall climbs.
Hybrid + RRF + reranking (Chapter 06, 07) — docs are full of exact config keys and error codes that pure vectors miss; the keyword side catches them, the reranker orders them. Recall and precision both climb.
Query reformulation (Chapter 10) — users ask follow-ups; reformulate to standalone questions so retrieval works mid-conversation.
Grounded generation with citations and refusal (Chapter 09) — the biggest faithfulness lever: bind every claim to a cited chunk and license "the docs don't cover this." Faithfulness jumps.

# V2 — the retrieve step, now hybrid + reranked (replaces V1's answer()).
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

GROUNDED = """Answer only from the numbered context. Cite each claim as [n].
If the context lacks the answer, say "The docs don't cover this." Be concise."""

def answer_v2(question, history=None):
    q = reformulate(history, question) if history else question   # Ch10
    candidates = hybrid_search(q, top_n=40)        # Ch06: vector + BM25 + RRF
    top = rerank(q, candidates, final_k=5)         # Ch07: cross-encoder
    context = "\n\n".join(f"[{i+1}] {c}" for i, (c, _) in enumerate(top))
    return llm.complete(f"{GROUNDED}\n\nContext:\n{context}\n\nQ: {q}")

Version 3 — the production build

Semantic cache (Chapter 15) — the repetitive majority of doc questions answered free and instant, with invalidation tied to document changes.
Incremental indexing (Chapter 15) — when a doc page changes, re-chunk and re-embed only that page, deleting its old chunks so stale and fresh never coexist.
Access pre-filter (Chapter 16) — if some docs are internal, filter by permission inside the query, and scope the cache by access.
Streaming + observability — stream the answer so perceived latency collapses, and log what was retrieved, cached, and generated so you can diagnose across the whole pipeline (the skill Chapter 18 demanded).
Injection hygiene (Chapter 16) — treat retrieved doc text as data, not instructions.

# V3 — the request path, composing cache + access + the V2 core.
def answer_v3(question, user, history=None, stream=True):
    cache_key = (question, frozenset(user.roles))   # scope cache by access!
    if (hit := cache.get(cache_key)) is not None:
        return hit                                   # free, instant
    q = reformulate(history, question) if history else question
    # access filter lives INSIDE retrieval (pre-filter, Ch16)
    candidates = hybrid_search(q, top_n=40, acl=user.roles)
    top = rerank(q, candidates, final_k=5)
    context = "\n\n".join(f"[{i+1}] {c}" for i, (c, _) in enumerate(top))
    answer = llm.complete(f"{GROUNDED}\n\nContext:\n{context}\n\nQ: {q}",
                          stream=stream)
    cache.put(cache_key, answer)
    return answer

What you've actually built — across the whole series

My take — and the whole series in one paragraph. The techniques in this manual will date; the models will change, the tools will churn, a 2027 reader will smile at some 2026 detail. What won't date is the loop: build the simplest thing, measure it honestly, improve only what the measurement says is weak, and respect the unglamorous stages where production actually breaks. Every chapter was an instance of that loop. If you take nothing else, take the loop. It's the difference between people who have opinions about RAG and people who build RAG that works — and you're now firmly in the second group.

When this fails

Skipping V1. Teams that start at V2's sophistication never get a baseline and can't tell whether their complexity helped. Always build and measure the naive version first.
Adding V2 upgrades in a batch. Apply four changes at once and you learn nothing about which one mattered — and you'll carry dead weight forever. One change, one measurement.
Shipping V2 as if it were V3. A system with great metrics and no cache, no access control, no observability falls over or leaks on contact with real traffic. Quality and shippability are different axes.
Treating V3 as the end. Production is where measurement starts mattering most — real queries reveal what your eval set didn't. The loop keeps running after launch.

Practice — your own capstone

Build V1 for real

Earn one V2 upgrade

Find your V3 gap

Takeaways

Build the system three times: naive baseline (a reference point), measured improvement (each change earned by a metric), production hardening (survivability, not accuracy).
Never skip V1 or its measurement — the baseline is what makes every later gain provable.
In V2, change one thing at a time and re-measure. The discipline of the loop matters more than any single upgrade.
V3 adds little to quality and everything to shippability: cache, incremental freshness, access pre-filter, streaming, observability, injection hygiene.
Techniques date; the loop doesn't. Build simple, measure honestly, improve what's weak, respect the unglamorous stages. That's the whole field manual.

Discussion

RAG in the wild — three case studies

Capstone — a documentation QA system, end to end

What you'll take away from this chapter

The progression

Version 1 — the naive baseline

Version 2 — the measured improvement

Version 3 — the production build

What you've actually built — across the whole series

When this fails

Practice — your own capstone

Build V1 for real

Earn one V2 upgrade

Find your V3 gap

Takeaways

Discussion

Related Tutorials

Capstone — a documentation QA system, end to end

What you'll take away from this chapter

The progression

Version 1 — the naive baseline

Version 2 — the measured improvement

Version 3 — the production build

What you've actually built — across the whole series

When this fails

Practice — your own capstone

Build V1 for real

Earn one V2 upgrade

Find your V3 gap

Takeaways

Discussion

Related Tutorials