Everything until now has been about finding the right chunks. This chapter is where they become an answer. It is tempting to think the hard part is over — you have the evidence, just ask the model to write it up. But generation is where a well-retrieved system still goes wrong in the ways users notice most: the model answers from its own training instead of your evidence, states things the chunks never said, cites nothing so nobody can check it, or confidently makes something up when the honest answer was "the documents don't cover this." Good generation is mostly about discipline — keeping the model tethered to the evidence and teaching it to admit when the evidence isn't there.
A RAG generation prompt has three parts, and each does a distinct job. The system instruction sets the rules: answer from the provided context, cite sources, refuse when the evidence is thin. The context is the retrieved chunks, clearly delimited and labelled so the model can cite them. The question is what the user actually asked. Assemble them carelessly and the model blurs its own knowledge into the answer; assemble them well and it stays on the evidence.
The first instruction everyone writes is "answer only using the provided context." It's necessary, but on its own it leaks. The model has read much of the internet; when your context is silent or ambiguous on a point, it quietly fills the gap from training — and that filler is exactly the kind of plausible, outdated, or wrong material RAG exists to avoid. The fix is to make the instruction specific and behavioural rather than a single polite request:
This is the bridge from the Prompt Engineering series: RAG generation is a prompting problem with a context injected, and the same precision that makes any prompt good makes a grounding prompt reliable.
An answer without citations asks to be trusted blindly, which in any serious setting is unacceptable — the whole point of RAG over a bare LLM is that the answer is checkable. The mechanism is simple: label each chunk with a number in the context, instruct the model to cite the supporting chunk after each claim, and render those citations as links back to the source document using the metadata you captured in Chapter 02. Done well, a user reads "Digital goods can't be refunded once downloaded [2]" and clicks [2] to see the exact policy chunk. Trust, earned rather than demanded.
# A grounded answer with inline citations and licensed refusal.
SYSTEM = """You answer strictly from the numbered context below.
Rules:
- Every claim must be supported by a specific chunk, cited inline as [n].
- If the context does not contain the answer, reply exactly:
"The documentation doesn't cover this." Do not guess.
- Do not use knowledge beyond the context, even if you think you know.
- Be concise. Answer the question asked, nothing more."""
def build_prompt(question, chunks):
"""chunks: list of (text, source_label). Numbered so the model can cite."""
context = "\n\n".join(
f"[{i+1}] {text}\n(source: {src})"
for i, (text, src) in enumerate(chunks))
return f"{SYSTEM}\n\nCONTEXT:\n{context}\n\nQUESTION: {question}"
def answer(question, chunks, llm):
prompt = build_prompt(question, chunks)
return llm.complete(prompt)
retrieved = [
("Refunds are available within 30 days of purchase.", "refund-policy#window"),
("Digital goods are non-refundable once downloaded.", "refund-policy#digital"),
]
print(answer("Can I refund a game I already downloaded?", retrieved, llm))
No. Digital goods are non-refundable once they have been
downloaded [2], even though most purchases are otherwise
refundable within 30 days [1].
Look at what the discipline produced. The model gave the correct, evidence-bound answer ("no, because downloaded"), cited the governing chunk [2], and contextualised it against the general policy [1] — all checkable, none invented. Now test the refusal: ask something the context doesn't cover.
Q: Can I refund a physical boxed copy?
A: The documentation doesn't cover this.
That refusal is a feature, not a failure. The chunks said nothing about physical goods, so the model declined rather than inventing a policy. A system that says "I don't know" when it doesn't know is far more valuable than one that always answers — because users can trust the answers it does give. Licensing that refusal in the system prompt is what made it possible.
The default way to give the model evidence is "stuffing": concatenate all retrieved chunks into one prompt and ask once. For the typical handful of chunks, stuffing is correct and you should not do anything fancier. Two alternatives exist for when stuffing breaks down:
Reach for these only when stuffing genuinely fails. Most systems never need them, and adding them prematurely is the same over-engineering reflex this series keeps warning against.
Here is the honest part most tutorials skip: even a perfectly instructed, well-grounded prompt will sometimes produce claims the context doesn't support. Grounding reduces hallucination; it does not eliminate it. Three mechanisms cause it, and knowing them helps you reduce each.
| Mechanism | What happens | What reduces it |
|---|---|---|
| Lost in the middle | The answer is in a chunk buried mid-context; the model attends to the start and end and misses it. | Fewer, better chunks (reranking, Ch07); put the strongest chunk first or last. |
| Over-extension | The context supports part of a claim; the model extends past the evidence into a plausible guess. | Explicit "cite every claim" rule; lower temperature; concise-answer constraint. |
| Parametric override | The model's training "knows" a confident answer and overrides weak or conflicting context. | Strong "ignore outside knowledge" instruction; flag when context conflicts with itself. |
My take. The "lost in the middle" effect is the quiet link back to Chapter 00: it's the same reason stuffing a million tokens of context underperforms five good chunks. More context is not more grounding. Past a handful of strong chunks, additional context dilutes attention and increases hallucination risk. When an answer is wrong, the instinct to "add more chunks" is usually exactly backwards — fewer, better-ranked chunks is the fix.
Everything in this chapter aims at one quality: faithfulness — the fraction of claims in the answer actually supported by the context, defined back in Chapter 01. You should not be guessing whether your grounding prompt works; you should be measuring faithfulness on your eval set and watching it move when you change the prompt. That measurement is the entire subject of Chapter 11, two chapters from now. For now, hold the principle: every generation rule in this chapter is a hypothesis you can test, not an article of faith.
Write a system prompt with the four behavioural rules above. Then try to break it: ask a question your context doesn't cover and see if it refuses, ask one it partly covers and see if it over-extends. Tighten the prompt against each failure. This adversarial loop is how grounding prompts actually get good.
Put three questions through your system: one fully answerable, one not covered at all, one half-covered. The right behaviours are answer-with-citation, clean refusal, and partial-answer-that-flags-the-gap. If the not-covered question gets a confident answer, your refusal licensing is too weak.
Take a question whose answer is in one chunk. Run generation with that chunk first, then with it buried among nine irrelevant chunks in the middle. Compare the answers. Watching the buried-chunk answer degrade builds the intuition that fewer, better chunks beat more chunks — the lesson that ties this chapter back to the very first one.
Next chapter: Conversational RAG — multi-turn, state, follow-ups. Real users don't ask one question — they have a conversation. "What about the Pro plan?" means nothing without the turn before it. We'll handle history, follow-ups, and the re-retrieval problem that single-shot RAG ignores.
Sign in to join the discussion and post comments.
Sign in