Every example so far has assumed a single, self-contained question. Real users don't work that way. They ask "what's the refund window?", read the answer, then type "what about digital goods?" — and that second message is meaningless on its own. There's no subject, no verb that matters, nothing for a retriever to match. A human reads it effortlessly because they remember the previous turn. Your retrieval pipeline has no such memory unless you build it one. This chapter is about making RAG work in conversation, where most of the difficulty lives in three words: "what about it?"
The retriever from earlier chapters embeds the query and finds similar chunks. Feed it "what about digital goods?" and it embeds those four words — which are about nothing in particular — and retrieves a scattering of chunks that happen to mention digital goods in any context. The actual intent, "what is the refund window for digital goods," lived in the previous turn, and the retriever never saw it. The answer the user gets is confidently irrelevant.
This is the single biggest gap between a RAG demo and a RAG product. Demos ask one clean question. Products hold conversations. And the fix is not to make retrieval smarter — it's to repair the query before retrieval, using the conversation history, exactly the query-understanding move from Chapter 08, now driven by history rather than by the query alone.
The fix is a step that runs before retrieval on every turn: take the conversation history and the new message, and ask a fast LLM to rewrite them into a single standalone question that carries all the context it needs. "What about digital goods?" plus the prior turn about refund windows becomes "What is the refund window for digital goods?" — a question that retrieves perfectly on its own. The retriever, the reranker, the generator all stay exactly as they were; you've only repaired the query using memory.
# Reformulate a follow-up into a standalone question, then run normal RAG.
REFORMULATE = """Given the conversation so far and the user's latest message,
rewrite the latest message as a standalone question that needs no prior
context to understand. If it is already standalone, return it unchanged.
Output only the rewritten question.
Conversation:
{history}
Latest message: {message}"""
def to_standalone(history, message, llm):
"""history: list of (role, text). Returns a self-contained question."""
convo = "\n".join(f"{role}: {text}" for role, text in history)
prompt = REFORMULATE.format(history=convo, message=message)
return llm.complete(prompt).strip()
def conversational_rag(history, message, llm, retrieve, generate):
# 1. repair the query using history
standalone = to_standalone(history, message, llm)
# 2. the rest is ordinary single-shot RAG on the repaired query
chunks = retrieve(standalone) # hybrid + rerank, Ch06–07
answer = generate(standalone, chunks) # grounded generation, Ch09
return standalone, answer
history = [
("User", "what's the refund window?"),
("Bot", "Refunds are available within 30 days of purchase. [1]"),
]
standalone, answer = conversational_rag(
history, "what about digital goods?", llm, retrieve, generate)
print("reformulated:", standalone)
reformulated: What is the refund window for digital goods?
That one reformulation step is the difference between a chatbot that loses the thread on the second message and one that holds a coherent conversation. Notice it reuses the exact retrieval and generation pipeline from the previous chapters — conversation didn't require rebuilding anything, only adding a query-repair step in front.
Conversations grow without bound, and you cannot feed an ever-lengthening transcript into every reformulation and every generation forever — it gets slow, expensive, and eventually overflows the context window. So you manage what you keep. Three approaches, in increasing sophistication:
| Strategy | How it works | Best when |
|---|---|---|
| Windowing | Keep only the last N turns. | Most conversations — recent turns carry the live context. Simple and robust. |
| Summarisation | Compress older turns into a running summary, keep recent turns verbatim. | Long sessions where early context still matters but doesn't need full detail. |
| Key-fact extraction | Pull durable facts ("user is on the Pro plan, EU region") into a small state object. | Task-oriented assistants where specific user attributes recur across many turns. |
Start with windowing — the last few turns are almost always enough for reformulation, because follow-ups refer to recent context, not the conversation's distant past. Reach for summarisation only when you observe the system losing context that scrolled out of the window, and key-fact extraction only when durable user attributes genuinely drive answers. This is the same "start simple, add on measured need" discipline as everywhere else in the series.
A subtler question: when a follow-up arrives, do you always run retrieval again, or sometimes reuse the chunks you already fetched? Two cases:
The honest default is to re-retrieve on every turn — it's simpler, and re-retrieval on the reformulated question is rarely wrong. Reuse is an optimisation for latency and cost, worth adding only once you've measured that a meaningful fraction of follow-ups drill into existing evidence. Premature reuse logic causes more bugs than it saves milliseconds.
My take. Reformulation quality is the thing to obsess over in conversational RAG, and the thing most teams under-invest in. If the standalone question is wrong, every downstream stage — retrieval, reranking, generation — operates on a misunderstanding, and no amount of downstream quality recovers it. When a conversational system gives a baffling answer, check the reformulated question first. Nine times in ten, that's where it went wrong, not in retrieval.
Take any single-shot RAG system and have a two-turn conversation with it: a question, then a follow-up like "what about X?" with no reformulation. Watch it retrieve nonsense on the second turn. This failure, seen once with your own eyes, is the most convincing argument for the reformulation step.
Drop the reformulation step from the code above in front of your retriever. Run the same two-turn conversation. The follow-up should now retrieve correctly. Then try to break it: a fresh standalone question mid-conversation (does it leave it alone?), an ambiguous pronoun (does it guess or ask?).
Look at realistic conversations for your use case. How many turns back does context usually reach? That number sets your window size. Do durable user attributes (plan, region, role) change answers? If so, sketch the small state object you'd extract them into. Decide this deliberately rather than defaulting to "keep everything."
Next chapter: Evaluation — the part most series skip. Every chapter has said "measure it." This is the chapter that shows you how — retrieval metrics, generation metrics, building a golden set, and using an LLM as a judge correctly. It's the difference between engineering and guessing, and it's where Wave 1 ends.
Sign in to join the discussion and post comments.
Sign in