Nineteen chapters of parts. Now we assemble the whole machine. We're going to build one complete system — a documentation question-answering service, the most common RAG application there is — and we'll build it three times: a naive baseline that works and underwhelms, a measured improvement that earns every change with a number, and a production version hardened for the real world. Building it three times is the entire point. It's how you internalise the series' deepest lesson: RAG is not a technique you apply once, it's a loop you run — build, measure, improve — and the discipline of that loop matters more than any single trick inside it.
Three versions, each adding only what the previous version's measurements showed it needed. This is the shape of every real RAG project worth respecting.
The smallest complete system. Resist every urge to be clever; you're building a reference point, and a baseline you over-engineer is no baseline at all. This is the straight line from Chapter 01: embed, retrieve, generate.
# V1 — naive documentation QA. The whole thing, end to end.
import psycopg
from pgvector.psycopg import register_vector
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer("BAAI/bge-base-en-v1.5")
conn = psycopg.connect("postgresql://localhost/docsqa")
register_vector(conn)
def setup():
conn.execute("CREATE EXTENSION IF NOT EXISTS vector")
conn.execute("""CREATE TABLE IF NOT EXISTS chunks (
id bigserial PRIMARY KEY, doc_id text, content text,
embedding vector(768))""")
conn.execute("""CREATE INDEX IF NOT EXISTS chunks_hnsw ON chunks
USING hnsw (embedding vector_cosine_ops)""")
def chunk(text, size=800): # recursive-ish, simplified
paras, out, buf = text.split("\n\n"), [], ""
for p in paras:
if len(buf) + len(p) <= size: buf += "\n\n" + p
else: out.append(buf.strip()); buf = p
if buf: out.append(buf.strip())
return [c for c in out if c]
def ingest(doc_id, text):
for c in chunk(text):
v = embedder.encode(c, normalize_embeddings=True)
conn.execute("INSERT INTO chunks (doc_id, content, embedding) "
"VALUES (%s,%s,%s)", (doc_id, c, v))
conn.commit()
def answer(question, k=5):
q = embedder.encode(question, normalize_embeddings=True)
rows = conn.execute("SELECT content FROM chunks ORDER BY embedding <=> %s "
"LIMIT %s", (q, k)).fetchall()
context = "\n\n".join(r[0] for r in rows)
prompt = f"Answer using the context.\n\nContext:\n{context}\n\nQ: {question}"
return llm.complete(prompt)
setup(); ingest("guide", open("guide.md").read())
print(answer("how do I configure logging?"))
It works. It also disappoints, predictably: on a fifty-question eval set (built per Chapter 11) it scores recall@5 around 0.71 and faithfulness around 0.78 — it misses answers and occasionally drifts past the evidence. Do not skip measuring V1. Those mediocre numbers are the most valuable thing you have right now: they're the baseline that will tell you whether every later change actually helped.
Now we improve, and the rule is absolute: every change is justified by a number on the eval set, applied one at a time so you know what did what. Looking at V1's failures, four upgrades address what the metrics flagged:
# V2 — the retrieve step, now hybrid + reranked (replaces V1's answer()).
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
GROUNDED = """Answer only from the numbered context. Cite each claim as [n].
If the context lacks the answer, say "The docs don't cover this." Be concise."""
def answer_v2(question, history=None):
q = reformulate(history, question) if history else question # Ch10
candidates = hybrid_search(q, top_n=40) # Ch06: vector + BM25 + RRF
top = rerank(q, candidates, final_k=5) # Ch07: cross-encoder
context = "\n\n".join(f"[{i+1}] {c}" for i, (c, _) in enumerate(top))
return llm.complete(f"{GROUNDED}\n\nContext:\n{context}\n\nQ: {q}")
Re-run the eval after each change and watch the numbers move — recall@5 to about 0.91, faithfulness to about 0.94. Crucially, you now know which change bought which gain, because you measured one at a time. If structural chunking had moved nothing, you'd have dropped it. That discipline — change, measure, keep or discard — is V2's real lesson, more than any individual upgrade.
V2 is a good system. V3 is a shippable one, and the difference is everything from the production wave. Note that V3 adds essentially nothing to the quality metrics — its job is survivability, not accuracy:
# V3 — the request path, composing cache + access + the V2 core.
def answer_v3(question, user, history=None, stream=True):
cache_key = (question, frozenset(user.roles)) # scope cache by access!
if (hit := cache.get(cache_key)) is not None:
return hit # free, instant
q = reformulate(history, question) if history else question
# access filter lives INSIDE retrieval (pre-filter, Ch16)
candidates = hybrid_search(q, top_n=40, acl=user.roles)
top = rerank(q, candidates, final_k=5)
context = "\n\n".join(f"[{i+1}] {c}" for i, (c, _) in enumerate(top))
answer = llm.complete(f"{GROUNDED}\n\nContext:\n{context}\n\nQ: {q}",
stream=stream)
cache.put(cache_key, answer)
return answer
Same answers as V2, but now cheap, fast, fresh, secure, and debuggable. That gap between "good in a notebook" and "shippable to users" is the production wave in one diff — and it's the gap most demos never cross.
Step back. Across twenty chapters you went from "why does my LLM not know my data" to a system that ingests messy real documents, chunks them intelligently, embeds and indexes them, retrieves with hybrid search and reranking, understands and reformulates queries, generates grounded and cited answers that refuse when the evidence is thin, holds a conversation, is measured at every step, and is hardened for cost, latency, freshness, security, and scale. You can also reason about the specialised cases — multi-modal, code, graph — and choose tools without the marketing. That's not a tutorial's worth of knowledge. That's a practitioner's.
My take — and the whole series in one paragraph. The techniques in this manual will date; the models will change, the tools will churn, a 2027 reader will smile at some 2026 detail. What won't date is the loop: build the simplest thing, measure it honestly, improve only what the measurement says is weak, and respect the unglamorous stages where production actually breaks. Every chapter was an instance of that loop. If you take nothing else, take the loop. It's the difference between people who have opinions about RAG and people who build RAG that works — and you're now firmly in the second group.
Take a documentation set you care about — a project's docs, an API reference — and build the V1 baseline. Get it answering questions, then build your fifty-question eval set and measure it. You now have a real system and a real number. Everything else is improvement you can prove.
Pick the single upgrade your V1 numbers most call for — usually hybrid search or grounded generation — apply just that one, and re-measure. Feel the difference between "I added a reranker" and "the reranker raised recall@5 from 0.71 to 0.79." The second sentence is engineering.
For your system, list what stands between V2 and shippable: cache, freshness, access, observability, latency. Order them by what your actual deployment needs. That ordered list is your road to production — and you now have a chapter for every item on it.
That's the manual. You started at "why RAG, and why it didn't die with long context," and you finish with a complete, measured, production-shaped system and the judgement to adapt it to any corpus you meet. The rest is practice — go build something, measure it honestly, and improve it one provable step at a time.
If you want the companion discipline of giving these systems the ability to act rather than only answer, the Agentic SDLC field manual picks up exactly there. And if you ever need to revisit a stage, the chapters are arranged to be re-read out of order — this was always meant to be a manual you keep on the shelf, not a book you read once. Thank you for building alongside it.
Sign in to join the discussion and post comments.
Sign in