On this tutorial

RAG: A Field Manual for Building LLM Systems That Use Your Data

Foundations

Data

Retrieval

Generation & Evaluation

Specialized verticals

Production

Closing

Generation — grounding, citation, refusal

Everything until now has been about finding the right chunks. This chapter is where they become an answer. It is tempting to think the hard part is over — you have the evidence, just ask the model to write it up. But generation is where a well-retrieved system still goes wrong in the ways users notice most: the model answers from its own training instead of your evidence, states things the chunks never said, cites nothing so nobody can check it, or confidently makes something up when the honest answer was "the documents don't cover this." Good generation is mostly about discipline — keeping the model tethered to the evidence and teaching it to admit when the evidence isn't there.

What you'll take away from this chapter

How to structure the generation prompt so the model answers from evidence, not memory
Why "only use the context" is necessary but not sufficient — and what to add
How to make the model cite its sources so every claim is checkable
Refusal discipline — teaching the model to say "I don't know" instead of inventing
Why a perfectly grounded prompt can still hallucinate, and what reduces it

The anatomy of a generation prompt

A RAG generation prompt has three parts, and each does a distinct job. The system instruction sets the rules: answer from the provided context, cite sources, refuse when the evidence is thin. The context is the retrieved chunks, clearly delimited and labelled so the model can cite them. The question is what the user actually asked. Assemble them carelessly and the model blurs its own knowledge into the answer; assemble them well and it stays on the evidence.

Three parts, three jobs. The labelled context is what makes citation possible — the model can only point at "[2]" if the chunks arrived numbered. This structure is the difference between an answer you can trust and one you have to take on faith.

Why "only use the context" isn't enough

The first instruction everyone writes is "answer only using the provided context." It's necessary, but on its own it leaks. The model has read much of the internet; when your context is silent or ambiguous on a point, it quietly fills the gap from training — and that filler is exactly the kind of plausible, outdated, or wrong material RAG exists to avoid. The fix is to make the instruction specific and behavioural rather than a single polite request:

Bind every claim to evidence. Not "use the context" but "every sentence in your answer must be supported by a specific numbered chunk, cited inline."
Give explicit permission to fail. "If the context does not contain the answer, say 'The documentation doesn't cover this' — do not guess." Models over-help by default; you have to license the refusal.
Forbid outside knowledge by name. "Do not use any information beyond the context, even if you believe you know the answer." Naming it works better than implying it.
Constrain the shape. Tell it to answer concisely from the evidence rather than write an essay, which reduces the room for unsupported elaboration.

This is the bridge from the Prompt Engineering series: RAG generation is a prompting problem with a context injected, and the same precision that makes any prompt good makes a grounding prompt reliable.

Citations — making every claim checkable

An answer without citations asks to be trusted blindly, which in any serious setting is unacceptable — the whole point of RAG over a bare LLM is that the answer is checkable. The mechanism is simple: label each chunk with a number in the context, instruct the model to cite the supporting chunk after each claim, and render those citations as links back to the source document using the metadata you captured in Chapter 02. Done well, a user reads "Digital goods can't be refunded once downloaded [2]" and clicks [2] to see the exact policy chunk. Trust, earned rather than demanded.

A grounded generation step in code

# A grounded answer with inline citations and licensed refusal.
SYSTEM = """You answer strictly from the numbered context below.

Rules:
- Every claim must be supported by a specific chunk, cited inline as [n].
- If the context does not contain the answer, reply exactly:
  "The documentation doesn't cover this." Do not guess.
- Do not use knowledge beyond the context, even if you think you know.
- Be concise. Answer the question asked, nothing more."""

def build_prompt(question, chunks):
    """chunks: list of (text, source_label). Numbered so the model can cite."""
    context = "\n\n".join(
        f"[{i+1}] {text}\n(source: {src})"
        for i, (text, src) in enumerate(chunks))
    return f"{SYSTEM}\n\nCONTEXT:\n{context}\n\nQUESTION: {question}"

def answer(question, chunks, llm):
    prompt = build_prompt(question, chunks)
    return llm.complete(prompt)

retrieved = [
    ("Refunds are available within 30 days of purchase.", "refund-policy#window"),
    ("Digital goods are non-refundable once downloaded.", "refund-policy#digital"),
]
print(answer("Can I refund a game I already downloaded?", retrieved, llm))

No. Digital goods are non-refundable once they have been
downloaded [2], even though most purchases are otherwise
refundable within 30 days [1].

Look at what the discipline produced. The model gave the correct, evidence-bound answer ("no, because downloaded"), cited the governing chunk [2], and contextualised it against the general policy [1] — all checkable, none invented. Now test the refusal: ask something the context doesn't cover.

Q: Can I refund a physical boxed copy?
A: The documentation doesn't cover this.

That refusal is a feature, not a failure. The chunks said nothing about physical goods, so the model declined rather than inventing a policy. A system that says "I don't know" when it doesn't know is far more valuable than one that always answers — because users can trust the answers it does give. Licensing that refusal in the system prompt is what made it possible.

Assembling context — stuffing, and when it breaks

The default way to give the model evidence is "stuffing": concatenate all retrieved chunks into one prompt and ask once. For the typical handful of chunks, stuffing is correct and you should not do anything fancier. Two alternatives exist for when stuffing breaks down:

Map-reduce — when you have too many chunks to fit or want each considered independently: ask the model about each chunk separately (map), then combine the partial answers (reduce). More calls, more cost, used when one prompt can't hold everything.
Refine — answer from the first chunk, then iteratively improve the answer with each subsequent chunk. Sequential and slow; occasionally useful for synthesis tasks. Rarely the right default.

Reach for these only when stuffing genuinely fails. Most systems never need them, and adding them prematurely is the same over-engineering reflex this series keeps warning against.

Why a grounded prompt still hallucinates

Here is the honest part most tutorials skip: even a perfectly instructed, well-grounded prompt will sometimes produce claims the context doesn't support. Grounding reduces hallucination; it does not eliminate it. Three mechanisms cause it, and knowing them helps you reduce each.

Mechanism	What happens	What reduces it
Lost in the middle	The answer is in a chunk buried mid-context; the model attends to the start and end and misses it.	Fewer, better chunks (reranking, Ch07); put the strongest chunk first or last.
Over-extension	The context supports part of a claim; the model extends past the evidence into a plausible guess.	Explicit "cite every claim" rule; lower temperature; concise-answer constraint.
Parametric override	The model's training "knows" a confident answer and overrides weak or conflicting context.	Strong "ignore outside knowledge" instruction; flag when context conflicts with itself.

My take. The "lost in the middle" effect is the quiet link back to Chapter 00: it's the same reason stuffing a million tokens of context underperforms five good chunks. More context is not more grounding. Past a handful of strong chunks, additional context dilutes attention and increases hallucination risk. When an answer is wrong, the instinct to "add more chunks" is usually exactly backwards — fewer, better-ranked chunks is the fix.

Faithfulness is measurable

Everything in this chapter aims at one quality: faithfulness — the fraction of claims in the answer actually supported by the context, defined back in Chapter 01. You should not be guessing whether your grounding prompt works; you should be measuring faithfulness on your eval set and watching it move when you change the prompt. That measurement is the entire subject of Chapter 11, two chapters from now. For now, hold the principle: every generation rule in this chapter is a hypothesis you can test, not an article of faith.

When this fails

No licensed refusal. If you never tell the model it's allowed to say "I don't know," it will answer everything, inventing policy when the context is silent. The permission to fail is the single most important line in a grounding prompt.
Unlabelled context. If chunks aren't numbered or delimited, the model can't cite them, and you lose checkability. Always structure the context so each chunk is individually addressable.
Citations the model fabricates. A model can cite [3] for a claim [3] doesn't support — a citation that looks rigorous and isn't. Spot-check that cited chunks actually contain the claim, and include citation-accuracy in your eval.
Too much context. Stuffing twenty marginal chunks to "be safe" triggers lost-in-the-middle and raises hallucination. Trust your reranker (Ch07) and send few, strong chunks.
High temperature on a grounding task. Creative-writing settings invite the model to embellish past the evidence. Grounded generation wants a low temperature — you want fidelity, not flair.
Conflicting chunks, unflagged. When two retrieved chunks disagree (an outdated policy and a current one), a silent model picks one arbitrarily. Instruct it to surface the conflict rather than resolve it invisibly — and fix the stale source upstream.

Practice — before you read the next chapter

Write and break your grounding prompt

Write a system prompt with the four behavioural rules above. Then try to break it: ask a question your context doesn't cover and see if it refuses, ask one it partly covers and see if it over-extends. Tighten the prompt against each failure. This adversarial loop is how grounding prompts actually get good.

Test the refusal explicitly

Put three questions through your system: one fully answerable, one not covered at all, one half-covered. The right behaviours are answer-with-citation, clean refusal, and partial-answer-that-flags-the-gap. If the not-covered question gets a confident answer, your refusal licensing is too weak.

Probe lost-in-the-middle

Take a question whose answer is in one chunk. Run generation with that chunk first, then with it buried among nine irrelevant chunks in the middle. Compare the answers. Watching the buried-chunk answer degrade builds the intuition that fewer, better chunks beat more chunks — the lesson that ties this chapter back to the very first one.

Takeaways

A generation prompt has three parts — system rules, labelled context, question. The labelling is what makes citation possible.
"Only use the context" leaks. Add behavioural rules: bind every claim to a cited chunk, license refusal explicitly, forbid outside knowledge by name, constrain length.
Citations turn an answer you must trust into one you can check. They are the core value of RAG over a bare model.
A licensed "I don't know" is a feature. A system that refuses when the evidence is thin earns trust in the answers it does give.
Grounding reduces hallucination but doesn't eliminate it — lost-in-the-middle, over-extension, and parametric override remain. Fewer, better-ranked chunks beat more chunks.
Every rule here is a testable hypothesis. Measure faithfulness; don't assume it.

Next chapter: Conversational RAG — multi-turn, state, follow-ups. Real users don't ask one question — they have a conversation. "What about the Pro plan?" means nothing without the turn before it. We'll handle history, follow-ups, and the re-retrieval problem that single-shot RAG ignores.

Discussion

Query understanding — rewrite, decompose, route Conversational RAG — multi-turn, state, follow-ups