Every six months since 2023, someone announces that RAG is dead. The context window got bigger, so now you can just paste everything into the prompt — no retrieval, no vector database, no pipeline. And every six months, the teams actually running these systems in production quietly keep their retrieval layer, because the announcement confuses a demo with an operation. This chapter is the honest accounting: where retrieval genuinely still earns its place, where it doesn't, and how to tell which situation you're in before you build anything.
It is the chapter to read if you are deciding whether the rest of this series is worth your time. There's no pitch here. Plenty of problems do not need RAG, and saying so is the only way to be trusted about the ones that do.
Retrieval-Augmented Generation is the practice of fetching relevant pieces of your own data at the moment a question is asked, and putting those pieces into the model's prompt as evidence before it answers. That's the whole idea. The model stays frozen — you never change its weights. What changes is the context you hand it, assembled fresh for each question from a store of your documents.
The word "augmented" is doing the load-bearing work. You are not teaching the model anything permanently. You are augmenting a single response with material it can read just this once. Next question, fresh material. The model is a brilliant reasoner with amnesia, and RAG is the briefing folder you slide across the table before each meeting.
If you have read the Prompt Engineering series, here is the one-line bridge: RAG is automated, dynamic, evidence-based prompt construction. Everything you know about writing a good prompt still applies — RAG just decides what goes in the prompt programmatically, per query.
The argument deserves a fair hearing, because the people making it are not foolish. It goes like this: in 2023 a context window was 4,000 to 8,000 tokens, so you had no choice but to retrieve the few relevant snippets and discard the rest. By 2026 windows run from 200,000 tokens to over a million. A million tokens is roughly 750,000 words — several thick books. So why bother retrieving? Paste the whole knowledge base in and let the model sort it out.
For a small, static knowledge base, this is sometimes exactly right, and this series will tell you so plainly in a moment. But four things break the argument the instant you leave the demo and enter an operation.
This is the one that ends most arguments. You are billed per token on input as well as output. Suppose your knowledge base is 1,000,000 tokens and you stuff all of it into every prompt. At a representative 2026 input price of around $3 per million input tokens, every single question costs about three dollars before the model has written one word of answer. Ten thousand questions a day is thirty thousand dollars a day. Retrieval sends maybe 4,000 tokens of carefully chosen evidence instead — roughly a cent per query. The same workload drops from $30,000 to about $40 a day. The gap is not a rounding error; it is the difference between a viable product and a bankrupt one.
A model has to read the whole prompt before it produces the first token of output. Reading a million tokens takes real wall-clock time — often several seconds of "prefill" before anything appears. Users feel that. A retrieval step that adds 50 milliseconds and then sends a small prompt produces a visibly faster experience than a giant prompt that the model must wade through every time.
This one surprises people. Bigger context does not mean the model uses all of it equally well. A well-documented effect — often called "lost in the middle" — shows that models attend most reliably to the beginning and end of a long prompt and least reliably to the middle. Bury the one relevant paragraph at the 400,000-token mark and the model may simply not notice it, even though it technically "read" it. Sending five tightly relevant chunks beats sending a thousand chunks where the right one is drowned. We return to the mechanics of this in Chapter 09 on generation.
Real knowledge bases are not a tidy 750,000 words. They are tens of millions of documents, and they change while you are reading this sentence — a new support ticket, an edited policy, a fresh commit. You cannot paste fifty million documents into any window, and you certainly cannot re-paste them every time one changes. An index, by contrast, accepts an incremental update and is ready for the next query. Scale and freshness together are the reasons RAG is not going anywhere, regardless of how large windows grow.
RAG is one of four genuine options for getting a model to answer using information it was not trained on. Choosing well starts with knowing the whole menu, not just the dish you came in wanting.
| Approach | What it does | Best when | Main cost |
|---|---|---|---|
| Plain search | Returns documents; a human reads them. No generation. | Users want sources, not synthesised answers; "find the document" is the job. | No synthesis; user does the reading. |
| Long context | Paste the whole corpus into every prompt. | Corpus is small (under ~100K tokens), static, and query volume is low. | Per-query token cost and latency; hits a hard ceiling on size. |
| RAG | Fetch relevant pieces per query; model answers from them. | Corpus is large, changing, or both; answers must cite fresh evidence. | A pipeline to build and operate (this whole series). |
| Fine-tuning | Adjust the model's weights on your data. | You need a behaviour or style change, not fresh facts. | Training cost; goes stale; can't cite; hard to update. |
The most common and most expensive mistake in this whole field is reaching for fine-tuning when the actual need is fresh facts. Fine-tuning teaches a model how to behave — a tone, a format, a domain's idiom. It is poor at teaching what is currently true, because the moment a fact changes, the fine-tune is stale and you cannot surgically edit one fact out of a set of weights. If the question is "answer using our latest documentation," the answer is almost never fine-tuning. It is RAG, possibly with a light fine-tune on top for tone.
My take. These approaches are not rivals so much as layers. A mature system often does plain search for "show me the doc," RAG for "answer my question with citations," and a small fine-tune for house style — all at once. Treating them as an either/or is what leads teams down a single expensive path when a blend was cheaper and better.
Before any architecture work, walk these questions in order. The first "yes" usually points at your answer.
Picture a support team with 40,000 help-centre articles that change daily. A customer asks, "Why was I charged twice this month?" A plain-search system returns ten articles and the customer sighs and starts reading. A long-context system is impossible — 40,000 articles will not fit, and they change too often to re-paste. A fine-tuned model answers in a confident voice using last quarter's billing policy, which changed in March, and is now wrong in a way nobody can see.
The RAG system retrieves the three articles actually relevant to double-charging under the current policy, hands them to the model as evidence, and the model answers in two sentences with a link to the exact article. Fresh, grounded, cheap, fast. That is the shape of a problem RAG was built for, and it is an extremely common shape.
Every chapter in this series ends with the honest failure modes, because knowing when an approach is wrong is worth more than knowing when it is right. For the decision itself, four ways teams get this wrong:
The one-line test. If the honest answer to "where does the right information live?" is "in documents that are large, numerous, or frequently changing," you want RAG. If it's "in how the model should phrase things," you want fine-tuning. If it's "the user just needs the document," you want search. Most teams need the first. Some need a blend. Almost nobody needs only fine-tuning.
Take a real problem you're considering an LLM for. Answer the four decision-tree questions for it, out loud or on paper. Where does it land? If it lands on "long context is fine," be honest and save yourself this whole series for that problem. If it lands on RAG, write one sentence describing where the right information lives — you'll use it as you read on.
Estimate the size of your knowledge base in tokens (a rough rule: one page of text is about 500 tokens). Estimate your daily query volume. Multiply size × volume × an input price of $3 per million tokens to see what the "paste everything" approach would cost per day. Then do the same assuming retrieval sends 4,000 tokens per query. The ratio between the two numbers is, quite literally, the business case for this series.
If you have access to a long-context model, take a 30-page document, hide a single odd sentence (something like "the project codename is Powder Blue") near the exact middle, and ask the model to find it. Then move the same sentence to the first page and ask again. Notice whether reliability changes. You will develop an intuition that no amount of reading about "attention" can give you.
Next chapter: Foundations — the RAG vocabulary. The terms every later chapter assumes — chunk, embedding, vector index, retriever, reranker, recall, faithfulness — defined once, clearly, so the rest of the series reads at full speed.
Sign in to join the discussion and post comments.
Sign in