On this tutorial

RAG: A Field Manual for Building LLM Systems That Use Your Data

Foundations

Data

Retrieval

Generation & Evaluation

Specialized verticals

Production

Closing

Why RAG, and why it didn't die with long context

Every six months since 2023, someone announces that RAG is dead. The context window got bigger, so now you can just paste everything into the prompt — no retrieval, no vector database, no pipeline. And every six months, the teams actually running these systems in production quietly keep their retrieval layer, because the announcement confuses a demo with an operation. This chapter is the honest accounting: where retrieval genuinely still earns its place, where it doesn't, and how to tell which situation you're in before you build anything.

It is the chapter to read if you are deciding whether the rest of this series is worth your time. There's no pitch here. Plenty of problems do not need RAG, and saying so is the only way to be trusted about the ones that do.

What you'll take away from this chapter

What RAG actually is, stripped of the acronym, in one honest sentence
Why a million-token context window did not make retrieval obsolete — the four reasons that survive scrutiny
The four ways to give a model knowledge it wasn't trained on, and what each one really costs
A decision tree you can run in your head before committing to an architecture
The honest cases where RAG is the wrong answer

RAG, in one sentence

Retrieval-Augmented Generation is the practice of fetching relevant pieces of your own data at the moment a question is asked, and putting those pieces into the model's prompt as evidence before it answers. That's the whole idea. The model stays frozen — you never change its weights. What changes is the context you hand it, assembled fresh for each question from a store of your documents.

The word "augmented" is doing the load-bearing work. You are not teaching the model anything permanently. You are augmenting a single response with material it can read just this once. Next question, fresh material. The model is a brilliant reasoner with amnesia, and RAG is the briefing folder you slide across the table before each meeting.

If you have read the Prompt Engineering series, here is the one-line bridge: RAG is automated, dynamic, evidence-based prompt construction. Everything you know about writing a good prompt still applies — RAG just decides what goes in the prompt programmatically, per query.

The "long context killed RAG" argument, taken seriously

The argument deserves a fair hearing, because the people making it are not foolish. It goes like this: in 2023 a context window was 4,000 to 8,000 tokens, so you had no choice but to retrieve the few relevant snippets and discard the rest. By 2026 windows run from 200,000 tokens to over a million. A million tokens is roughly 750,000 words — several thick books. So why bother retrieving? Paste the whole knowledge base in and let the model sort it out.

For a small, static knowledge base, this is sometimes exactly right, and this series will tell you so plainly in a moment. But four things break the argument the instant you leave the demo and enter an operation.

The blue pair is about economics, the green pair is about quality and reality. Any one of the four is enough to keep retrieval in your stack; in production you usually hit all four.

Cost

This is the one that ends most arguments. You are billed per token on input as well as output. Suppose your knowledge base is 1,000,000 tokens and you stuff all of it into every prompt. At a representative 2026 input price of around $3 per million input tokens, every single question costs about three dollars before the model has written one word of answer. Ten thousand questions a day is thirty thousand dollars a day. Retrieval sends maybe 4,000 tokens of carefully chosen evidence instead — roughly a cent per query. The same workload drops from $30,000 to about $40 a day. The gap is not a rounding error; it is the difference between a viable product and a bankrupt one.

Latency

A model has to read the whole prompt before it produces the first token of output. Reading a million tokens takes real wall-clock time — often several seconds of "prefill" before anything appears. Users feel that. A retrieval step that adds 50 milliseconds and then sends a small prompt produces a visibly faster experience than a giant prompt that the model must wade through every time.

Accuracy

This one surprises people. Bigger context does not mean the model uses all of it equally well. A well-documented effect — often called "lost in the middle" — shows that models attend most reliably to the beginning and end of a long prompt and least reliably to the middle. Bury the one relevant paragraph at the 400,000-token mark and the model may simply not notice it, even though it technically "read" it. Sending five tightly relevant chunks beats sending a thousand chunks where the right one is drowned. We return to the mechanics of this in Chapter 09 on generation.

Freshness and scale

Real knowledge bases are not a tidy 750,000 words. They are tens of millions of documents, and they change while you are reading this sentence — a new support ticket, an edited policy, a fresh commit. You cannot paste fifty million documents into any window, and you certainly cannot re-paste them every time one changes. An index, by contrast, accepts an incremental update and is ready for the next query. Scale and freshness together are the reasons RAG is not going anywhere, regardless of how large windows grow.

The four ways to give a model knowledge

RAG is one of four genuine options for getting a model to answer using information it was not trained on. Choosing well starts with knowing the whole menu, not just the dish you came in wanting.

Approach	What it does	Best when	Main cost
Plain search	Returns documents; a human reads them. No generation.	Users want sources, not synthesised answers; "find the document" is the job.	No synthesis; user does the reading.
Long context	Paste the whole corpus into every prompt.	Corpus is small (under ~100K tokens), static, and query volume is low.	Per-query token cost and latency; hits a hard ceiling on size.
RAG	Fetch relevant pieces per query; model answers from them.	Corpus is large, changing, or both; answers must cite fresh evidence.	A pipeline to build and operate (this whole series).
Fine-tuning	Adjust the model's weights on your data.	You need a behaviour or style change, not fresh facts.	Training cost; goes stale; can't cite; hard to update.

The most common and most expensive mistake in this whole field is reaching for fine-tuning when the actual need is fresh facts. Fine-tuning teaches a model how to behave — a tone, a format, a domain's idiom. It is poor at teaching what is currently true, because the moment a fact changes, the fine-tune is stale and you cannot surgically edit one fact out of a set of weights. If the question is "answer using our latest documentation," the answer is almost never fine-tuning. It is RAG, possibly with a light fine-tune on top for tone.

My take. These approaches are not rivals so much as layers. A mature system often does plain search for "show me the doc," RAG for "answer my question with citations," and a small fine-tune for house style — all at once. Treating them as an either/or is what leads teams down a single expensive path when a blend was cheaper and better.

A decision tree you can run in your head

Before any architecture work, walk these questions in order. The first "yes" usually points at your answer.

Run top to bottom. Notice how many paths lead to RAG — not because RAG is fashionable, but because "large, changing, high-volume" describes most real knowledge bases. The bottom box is the honest exception, and it is a real one.

What "good" looks like, concretely

Picture a support team with 40,000 help-centre articles that change daily. A customer asks, "Why was I charged twice this month?" A plain-search system returns ten articles and the customer sighs and starts reading. A long-context system is impossible — 40,000 articles will not fit, and they change too often to re-paste. A fine-tuned model answers in a confident voice using last quarter's billing policy, which changed in March, and is now wrong in a way nobody can see.

The RAG system retrieves the three articles actually relevant to double-charging under the current policy, hands them to the model as evidence, and the model answers in two sentences with a link to the exact article. Fresh, grounded, cheap, fast. That is the shape of a problem RAG was built for, and it is an extremely common shape.

When this fails

Every chapter in this series ends with the honest failure modes, because knowing when an approach is wrong is worth more than knowing when it is right. For the decision itself, four ways teams get this wrong:

RAG for a tiny, static FAQ. Forty questions that never change do not need an index, embeddings, or a vector database. Paste them into the system prompt and move on. Building a pipeline here is engineering theatre — real effort spent to look sophisticated, producing a slower and more fragile result than the trivial version.
RAG when the real need is reasoning, not facts. If the failing answers are wrong because the model reasons poorly about data it already has, retrieving more data won't help. You have a prompting or model-choice problem, not a retrieval problem.
Fine-tuning for fresh facts. The expensive classic. Teams fine-tune monthly to "teach the model our docs," then discover the docs changed the day after training. Facts belong in retrieval; only behaviour belongs in weights.
RAG to paper over bad source data. Retrieval faithfully fetches whatever is in your corpus. If the corpus is contradictory, outdated, or wrong, RAG will ground the model in garbage with great confidence. Retrieval amplifies the quality of your data — in both directions. We confront this directly in Chapter 02 on data prep.

The one-line test. If the honest answer to "where does the right information live?" is "in documents that are large, numerous, or frequently changing," you want RAG. If it's "in how the model should phrase things," you want fine-tuning. If it's "the user just needs the document," you want search. Most teams need the first. Some need a blend. Almost nobody needs only fine-tuning.

Practice — before you read the next chapter

Map your own problem

Take a real problem you're considering an LLM for. Answer the four decision-tree questions for it, out loud or on paper. Where does it land? If it lands on "long context is fine," be honest and save yourself this whole series for that problem. If it lands on RAG, write one sentence describing where the right information lives — you'll use it as you read on.

Price it out

Estimate the size of your knowledge base in tokens (a rough rule: one page of text is about 500 tokens). Estimate your daily query volume. Multiply size × volume × an input price of $3 per million tokens to see what the "paste everything" approach would cost per day. Then do the same assuming retrieval sends 4,000 tokens per query. The ratio between the two numbers is, quite literally, the business case for this series.

Find the lost-in-the-middle effect yourself

If you have access to a long-context model, take a 30-page document, hide a single odd sentence (something like "the project codename is Powder Blue") near the exact middle, and ask the model to find it. Then move the same sentence to the first page and ask again. Notice whether reliability changes. You will develop an intuition that no amount of reading about "attention" can give you.

Takeaways

RAG is dynamic, evidence-based prompt construction: fetch the relevant pieces of your data per query, hand them to a frozen model as context.
Long context did not kill retrieval. Cost, latency, accuracy ("lost in the middle"), and freshness-at-scale each independently keep retrieval in production stacks.
There are four ways to give a model new knowledge — plain search, long context, RAG, fine-tuning. Know all four before choosing.
Fresh facts belong in retrieval; behaviour and style belong in weights. Fine-tuning for facts is the field's most expensive recurring mistake.
Run the decision tree first. Some problems genuinely don't need RAG — a small, static, low-volume corpus is fine on long context, and saying so is how this series earns your trust on everything that follows.

Next chapter: Foundations — the RAG vocabulary. The terms every later chapter assumes — chunk, embedding, vector index, retriever, reranker, recall, faithfulness — defined once, clearly, so the rest of the series reads at full speed.

Discussion

Foundations — the RAG vocabulary