Conversational RAG — multi-turn, state, follow-ups

Every example so far has assumed a single, self-contained question. Real users don't work that way. They ask "what's the refund window?", read the answer, then type "what about digital goods?" — and that second message is meaningless on its own. There's no subject, no verb that matters, nothing for a retriever to match. A human reads it effortlessly because they remember the previous turn. Your retrieval pipeline has no such memory unless you build it one. This chapter is about making RAG work in conversation, where most of the difficulty lives in three words: "what about it?"

What you'll take away from this chapter

Why single-shot RAG breaks the moment a user asks a follow-up
Query reformulation — turning a context-dependent message into a standalone question
How to manage conversation history without it growing forever or going stale
When to re-retrieve for a new turn and when to reuse what you already have
The failure modes unique to conversation, and how to catch them

The follow-up problem

The retriever from earlier chapters embeds the query and finds similar chunks. Feed it "what about digital goods?" and it embeds those four words — which are about nothing in particular — and retrieves a scattering of chunks that happen to mention digital goods in any context. The actual intent, "what is the refund window for digital goods," lived in the previous turn, and the retriever never saw it. The answer the user gets is confidently irrelevant.

This is the single biggest gap between a RAG demo and a RAG product. Demos ask one clean question. Products hold conversations. And the fix is not to make retrieval smarter — it's to repair the query before retrieval, using the conversation history, exactly the query-understanding move from Chapter 08, now driven by history rather than by the query alone.

Query reformulation — the core technique

The fix is a step that runs before retrieval on every turn: take the conversation history and the new message, and ask a fast LLM to rewrite them into a single standalone question that carries all the context it needs. "What about digital goods?" plus the prior turn about refund windows becomes "What is the refund window for digital goods?" — a question that retrieves perfectly on its own. The retriever, the reranker, the generator all stay exactly as they were; you've only repaired the query using memory.

The reformulation step is the whole trick. It converts a turn that depends on memory into a question that depends on nothing — which is exactly the kind of question the retriever was built for. Everything downstream stays the same.

# Reformulate a follow-up into a standalone question, then run normal RAG.
REFORMULATE = """Given the conversation so far and the user's latest message,
rewrite the latest message as a standalone question that needs no prior
context to understand. If it is already standalone, return it unchanged.
Output only the rewritten question.

Conversation:
{history}

Latest message: {message}"""

def to_standalone(history, message, llm):
    """history: list of (role, text). Returns a self-contained question."""
    convo = "\n".join(f"{role}: {text}" for role, text in history)
    prompt = REFORMULATE.format(history=convo, message=message)
    return llm.complete(prompt).strip()

def conversational_rag(history, message, llm, retrieve, generate):
    # 1. repair the query using history
    standalone = to_standalone(history, message, llm)
    # 2. the rest is ordinary single-shot RAG on the repaired query
    chunks = retrieve(standalone)          # hybrid + rerank, Ch06–07
    answer = generate(standalone, chunks)  # grounded generation, Ch09
    return standalone, answer

history = [
    ("User", "what's the refund window?"),
    ("Bot", "Refunds are available within 30 days of purchase. [1]"),
]
standalone, answer = conversational_rag(
    history, "what about digital goods?", llm, retrieve, generate)
print("reformulated:", standalone)

reformulated: What is the refund window for digital goods?

That one reformulation step is the difference between a chatbot that loses the thread on the second message and one that holds a coherent conversation. Notice it reuses the exact retrieval and generation pipeline from the previous chapters — conversation didn't require rebuilding anything, only adding a query-repair step in front.

Managing history — what to keep

Conversations grow without bound, and you cannot feed an ever-lengthening transcript into every reformulation and every generation forever — it gets slow, expensive, and eventually overflows the context window. So you manage what you keep. Three approaches, in increasing sophistication:

Strategy	How it works	Best when
Windowing	Keep only the last N turns.	Most conversations — recent turns carry the live context. Simple and robust.
Summarisation	Compress older turns into a running summary, keep recent turns verbatim.	Long sessions where early context still matters but doesn't need full detail.
Key-fact extraction	Pull durable facts ("user is on the Pro plan, EU region") into a small state object.	Task-oriented assistants where specific user attributes recur across many turns.

Start with windowing — the last few turns are almost always enough for reformulation, because follow-ups refer to recent context, not the conversation's distant past. Reach for summarisation only when you observe the system losing context that scrolled out of the window, and key-fact extraction only when durable user attributes genuinely drive answers. This is the same "start simple, add on measured need" discipline as everywhere else in the series.

Re-retrieve, or reuse?

A subtler question: when a follow-up arrives, do you always run retrieval again, or sometimes reuse the chunks you already fetched? Two cases:

The follow-up shifts topic — "what about digital goods?" after a refund-window question. The needed evidence is different, so you must re-retrieve on the reformulated question. This is the common case.
The follow-up drills into the same evidence — "can you explain that second point more simply?" The chunks you already have contain the answer; re-retrieving is wasted work and might even fetch worse chunks. Reuse what's in context.

The honest default is to re-retrieve on every turn — it's simpler, and re-retrieval on the reformulated question is rarely wrong. Reuse is an optimisation for latency and cost, worth adding only once you've measured that a meaningful fraction of follow-ups drill into existing evidence. Premature reuse logic causes more bugs than it saves milliseconds.

My take. Reformulation quality is the thing to obsess over in conversational RAG, and the thing most teams under-invest in. If the standalone question is wrong, every downstream stage — retrieval, reranking, generation — operates on a misunderstanding, and no amount of downstream quality recovers it. When a conversational system gives a baffling answer, check the reformulated question first. Nine times in ten, that's where it went wrong, not in retrieval.

When this fails

Skipping reformulation entirely. Feeding the raw follow-up to the retriever is the default failure — "what about it?" retrieves noise. Reformulation is not optional in a conversational system; it's the load-bearing step.
Over-reformulating a standalone question. If the user asks a fresh, self-contained question mid-conversation, an eager reformulator can wrongly fold in irrelevant prior context, dragging retrieval toward the old topic. Instruct it to leave already-standalone questions unchanged, and verify it does.
Unbounded history. Feeding the entire transcript into every turn slows the system and eventually overflows the window. Window or summarise from the start.
Losing a key fact that scrolled away. Pure windowing forgets that the user said, ten turns ago, that they're in the EU — which changes every refund answer. If durable attributes matter, extract them into state rather than relying on the window.
Pronoun ambiguity the reformulator guesses wrong. "Can I return it?" after discussing two products — which one? A reformulator forced to guess may pick wrong silently. When genuinely ambiguous, the better behaviour is to ask the user, not to guess.
Stale reused chunks. If you reuse prior chunks for a follow-up that actually shifted topic, you answer the new question from the old evidence. When in doubt, re-retrieve.

Practice — before you read the next chapter

Watch your system lose the thread

Take any single-shot RAG system and have a two-turn conversation with it: a question, then a follow-up like "what about X?" with no reformulation. Watch it retrieve nonsense on the second turn. This failure, seen once with your own eyes, is the most convincing argument for the reformulation step.

Add reformulation and re-test

Drop the reformulation step from the code above in front of your retriever. Run the same two-turn conversation. The follow-up should now retrieve correctly. Then try to break it: a fresh standalone question mid-conversation (does it leave it alone?), an ambiguous pronoun (does it guess or ask?).

Choose your history strategy

Look at realistic conversations for your use case. How many turns back does context usually reach? That number sets your window size. Do durable user attributes (plan, region, role) change answers? If so, sketch the small state object you'd extract them into. Decide this deliberately rather than defaulting to "keep everything."

Takeaways

Single-shot RAG breaks on follow-ups because context-dependent messages ("what about it?") carry no retrievable meaning on their own.
The fix is query reformulation: use the history to rewrite each turn into a standalone question, then run the unchanged retrieval-and-generation pipeline on it.
Manage history with windowing first; add summarisation or key-fact extraction only on measured need.
Re-retrieve by default; reuse prior chunks only as a measured optimisation for follow-ups that drill into the same evidence.
Reformulation quality dominates conversational RAG. When an answer is baffling, check the reformulated question before blaming retrieval.

Next chapter: Evaluation — the part most series skip. Every chapter has said "measure it." This is the chapter that shows you how — retrieval metrics, generation metrics, building a golden set, and using an LLM as a judge correctly. It's the difference between engineering and guessing, and it's where Wave 1 ends.

Discussion

Generation — grounding, citation, refusal Evaluation — the part most series skip