On this tutorial

RAG: A Field Manual for Building LLM Systems That Use Your Data

Foundations

Data

Retrieval

Generation & Evaluation

Specialized verticals

Production

Closing

Query understanding — rewrite, decompose, route

Everything so far has improved how we search — better indexes, hybrid retrieval, reranking. This chapter improves what we search for. Because here is the uncomfortable reality of production: users do not type clean queries. They type "it's broken again" with no subject. They ask three questions in one sentence. They use your internal product name that appears nowhere in the docs. They paste an error and a paragraph of context and a "pls help." The query that arrives is rarely the query your retriever wants. Query understanding is the stage that sits between the user and the index, repairing and reshaping the query so that retrieval has a fighting chance.

What you'll take away from this chapter

The gap between what users type and what retrieval needs — and why it widens in conversation
Query rewriting and HyDE — reshaping one query into a more retrievable form
Multi-query and decomposition — turning one query into several when one isn't enough
Routing — sending different query types down different paths
The honest cost: every one of these adds an LLM call before retrieval, so when not to use them

The query gap

Retrieval works best when the query looks like the answer. Embedding models place text by meaning, so a query phrased like the chunk that answers it lands near that chunk. But users don't phrase queries like answers — they phrase them like questions, often terse and context-dependent ones. "How much?" is a perfectly clear question to a human who saw the previous message and is meaningless to a retriever. The query-understanding stage exists to close that gap before the query reaches the index.

Query understanding is a pre-processing stage. It changes only the query, leaving retrieval, reranking, and generation untouched — which makes it cheap to add to an existing system and easy to turn off when it isn't helping.

Reshaping one query — rewriting and HyDE

Query rewriting

The simplest move: use a small, fast LLM to rewrite the messy query into a clean, retrievable one before it hits the index. Expand abbreviations, resolve internal jargon to the words your docs use, strip the "pls help," and — crucially in conversation — fold in context from earlier turns so "how much?" becomes "how much does the Pro plan cost?" Rewriting is the highest-value query technique in conversational systems precisely because of that context-folding, which is the whole subject of the next chapter.

HyDE — the hypothetical answer trick

HyDE (Hypothetical Document Embeddings) is a clever inversion. Instead of searching with the question, you first ask an LLM to write a hypothetical answer to the question — even a wrong one — and then search with that. Why does guessing help? Because an answer looks like the chunks you're trying to retrieve far more than a question does. The hypothetical answer might get facts wrong, but it gets the shape and vocabulary of a good answer right, and that shape is what the embedding model matches on. You're using the LLM's guess as a better-shaped search probe, then retrieving the real, correct chunks it points toward.

Splitting one query into several — multi-query and decomposition

Multi-query

One phrasing of a question retrieves one neighbourhood of the vector space. But there are many ways to phrase the same question, each landing slightly differently. Multi-query generates several paraphrases of the user's query, retrieves for each, and fuses the results (with RRF from Chapter 06). It widens the net, catching relevant chunks that any single phrasing would have missed. Useful when recall is your problem and the query is genuinely answerable but phrased narrowly.

Decomposition

Some questions are secretly several questions. "Compare the refund policy for digital and physical goods" is two retrievals — digital refunds, physical refunds — and a comparison. A single search for the whole sentence retrieves chunks that are half-relevant to each half and fully relevant to neither. Decomposition uses an LLM to break the compound question into sub-questions, retrieves for each separately, and combines the evidence. This is the technique that rescues the multi-part questions naive RAG quietly fails.

# Decompose a compound question into answerable sub-questions,
# retrieve for each, then merge — using any LLM client.
import json

def decompose(question, llm):
    """Ask the LLM to split a compound question into sub-questions.
    Returns the original wrapped in a list if it's already atomic."""
    prompt = f"""Break this question into the minimal set of standalone
sub-questions needed to answer it fully. If it is already a single
question, return it unchanged. Reply with a JSON array of strings only.

Question: {question}"""
    raw = llm.complete(prompt).strip()
    try:
        subs = json.loads(raw)
        return subs if isinstance(subs, list) and subs else [question]
    except json.JSONDecodeError:
        return [question]              # fall back to the original, never crash

def answer_compound(question, llm, retrieve):
    """retrieve(q) -> list[str] is your hybrid+rerank pipeline from Ch06–07."""
    sub_questions = decompose(question, llm)
    evidence = {}
    for sub in sub_questions:
        evidence[sub] = retrieve(sub)          # separate retrieval per sub-question
    return sub_questions, evidence

q = "Compare the refund policy for digital and physical goods."
subs, ev = answer_compound(q, llm, retrieve)
for s in subs:
    print("•", s)

• What is the refund policy for digital goods?
• What is the refund policy for physical goods?

Each sub-question now retrieves the chunk that actually answers it, cleanly. The generation step (next chapter) then composes the comparison from two solid pieces of evidence instead of one muddy retrieval. Note the defensive parsing — if the LLM returns malformed JSON, the code falls back to treating the question as atomic rather than crashing. That habit, treating every LLM call as fallible, is what separates a robust pipeline from a demo.

Step-back — zooming out for grounding

A subtler technique: for a very specific question, first ask a more general "step-back" question, retrieve the broad principle, then retrieve the specifics, and give the model both. "Can I get a refund on a gift card bought in March under promo code SPRING?" is so specific it may match nothing. The step-back question — "what is the refund policy for gift cards?" — retrieves the governing rule, which usually contains or implies the specific answer. You hand the model the general principle and the specific details together. It's decomposition's cousin, aimed at over-specific rather than compound queries.

Routing — different queries, different paths

Not every query wants the same treatment. A factual lookup ("what's the refund window?") wants straight hybrid retrieval. A compound question wants decomposition. A query that's actually about a specific document the user names wants a metadata filter, not a semantic search. Routing classifies the incoming query and sends it down the right path. The router can be a small classifier or a quick LLM call that picks among a fixed set of known routes.

My take. Routing is powerful and also where over-engineering creeps in fastest — it's the "modular RAG" temptation from Chapter 01 made concrete. Build it only once you have measured distinct query classes that genuinely need distinct handling, and keep the routes few and named. A router with eleven branches is a maintenance burden and a new failure surface — the classifier itself can misroute. Two or three obvious routes, added when the data demands them, is almost always the right amount.

The honest cost of all this

Every technique in this chapter adds an LLM call (or several) before retrieval even begins. That's latency the user feels and money you spend on every query. Query rewriting adds one fast call. Multi-query adds a call plus several extra retrievals. Decomposition adds a call plus a retrieval per sub-question. HyDE adds an answer-generation call. None are free, and stacking all of them turns a 200-millisecond retrieval into a multi-second pipeline.

Technique	Use when	Added cost
Rewriting	Queries are messy or conversational	One fast LLM call
HyDE	Queries are short; answers are verbose	One answer-generation call
Multi-query	Recall is low; query is narrowly phrased	One call + N extra retrievals
Decomposition	Questions are compound or multi-part	One call + one retrieval per sub-question
Routing	You have measured distinct query classes	One classification step per query

The discipline: add these one at a time, measure the quality gain against the latency cost on your eval set, and keep only the ones that pay for themselves. A naive system with none of these is the correct starting point — exactly as Chapter 01 argued. Reach here only when you've measured a specific query problem these techniques solve.

When this fails

Rewriting away the meaning. An over-eager rewriter can "clean up" a query into something subtly different from what the user asked, retrieving confidently for the wrong question. Keep the rewrite conservative and, when in doubt, retrieve for both the original and the rewrite and fuse.
HyDE on factual pinpoint queries. For "what is error E-4021," HyDE's hypothetical answer can hallucinate a plausible-but-wrong explanation that drags retrieval toward the wrong chunks. HyDE helps verbose conceptual queries, not exact-token lookups — which is exactly where hybrid's keyword side (Chapter 06) already shines.
Decomposing atomic questions. Force-decomposing a simple question invents sub-questions that scatter retrieval and muddy the answer. Let the decomposer return the question unchanged when it's already atomic — and verify it does.
Multi-query that drowns precision. Five paraphrases times ten chunks each is fifty candidates, many marginal. Without reranking after fusion, multi-query can lower precision even as it raises recall. Pair it with the reranker from Chapter 07.
A router that misroutes silently. The classifier is itself a model that can be wrong, sending a compound question down the simple path. Log routing decisions and include misrouting in your evaluation, or the router becomes an invisible source of failure.

Practice — before you read the next chapter

Audit your real queries

Collect fifty real user queries (or realistic ones). Sort them into buckets: clean and answerable as-is, messy/needs-rewriting, compound/needs-decomposition, over-specific/needs-step-back. The size of each bucket tells you which technique is worth building first — and how many of your queries need nothing at all.

Try decomposition by hand

Take your compound queries and run them through the decomposition code (or just an LLM prompt). Retrieve for the whole question and for each sub-question separately. Compare the chunks. The improvement on compound questions is usually dramatic and immediately convincing.

Measure the latency tax

For one technique you're considering, time a query end-to-end with and without it. Put that number next to the quality gain on your eval set. This quality-per-millisecond view is the only honest basis for deciding whether a query technique earns its place in your pipeline.

Takeaways

Users don't type retrievable queries. Query understanding repairs the query before it reaches the index, leaving the rest of the pipeline untouched.
Rewriting reshapes a messy query (and folds in conversation context); HyDE searches with a hypothetical answer because answers look more like chunks than questions do.
Multi-query widens the net for narrowly-phrased questions; decomposition splits compound questions into separately-retrievable parts; step-back zooms out for over-specific ones.
Routing sends different query types down different paths — powerful, but build it only for measured, distinct query classes and keep the routes few.
Every technique adds an LLM call before retrieval. Add them one at a time, measure quality against latency, and keep only what pays for itself.

Next chapter: Generation — grounding, citation, refusal. We've found the right chunks. Now the model has to answer from them — faithfully, with citations, and with the discipline to say "I don't know" when the evidence isn't there. This is where retrieval becomes an answer.

Discussion

Reranking — the second-stage detail Generation — grounding, citation, refusal

Query understanding — rewrite, decompose, route

What you'll take away from this chapter

The gap between what users type and what retrieval needs — and why it widens in conversation
Query rewriting and HyDE — reshaping one query into a more retrievable form
Multi-query and decomposition — turning one query into several when one isn't enough
Routing — sending different query types down different paths
The honest cost: every one of these adds an LLM call before retrieval, so when not to use them

The query gap

Reshaping one query — rewriting and HyDE

Query rewriting

HyDE — the hypothetical answer trick

Splitting one query into several — multi-query and decomposition

Multi-query

Decomposition

# Decompose a compound question into answerable sub-questions,
# retrieve for each, then merge — using any LLM client.
import json

def decompose(question, llm):
    """Ask the LLM to split a compound question into sub-questions.
    Returns the original wrapped in a list if it's already atomic."""
    prompt = f"""Break this question into the minimal set of standalone
sub-questions needed to answer it fully. If it is already a single
question, return it unchanged. Reply with a JSON array of strings only.

Question: {question}"""
    raw = llm.complete(prompt).strip()
    try:
        subs = json.loads(raw)
        return subs if isinstance(subs, list) and subs else [question]
    except json.JSONDecodeError:
        return [question]              # fall back to the original, never crash

def answer_compound(question, llm, retrieve):
    """retrieve(q) -> list[str] is your hybrid+rerank pipeline from Ch06–07."""
    sub_questions = decompose(question, llm)
    evidence = {}
    for sub in sub_questions:
        evidence[sub] = retrieve(sub)          # separate retrieval per sub-question
    return sub_questions, evidence

q = "Compare the refund policy for digital and physical goods."
subs, ev = answer_compound(q, llm, retrieve)
for s in subs:
    print("•", s)

• What is the refund policy for digital goods?
• What is the refund policy for physical goods?

Step-back — zooming out for grounding

Routing — different queries, different paths

My take. Routing is powerful and also where over-engineering creeps in fastest — it's the "modular RAG" temptation from Chapter 01 made concrete. Build it only once you have measured distinct query classes that genuinely need distinct handling, and keep the routes few and named. A router with eleven branches is a maintenance burden and a new failure surface — the classifier itself can misroute. Two or three obvious routes, added when the data demands them, is almost always the right amount.

The honest cost of all this

Technique	Use when	Added cost
Rewriting	Queries are messy or conversational	One fast LLM call
HyDE	Queries are short; answers are verbose	One answer-generation call
Multi-query	Recall is low; query is narrowly phrased	One call + N extra retrievals
Decomposition	Questions are compound or multi-part	One call + one retrieval per sub-question
Routing	You have measured distinct query classes	One classification step per query

When this fails

Rewriting away the meaning. An over-eager rewriter can "clean up" a query into something subtly different from what the user asked, retrieving confidently for the wrong question. Keep the rewrite conservative and, when in doubt, retrieve for both the original and the rewrite and fuse.
HyDE on factual pinpoint queries. For "what is error E-4021," HyDE's hypothetical answer can hallucinate a plausible-but-wrong explanation that drags retrieval toward the wrong chunks. HyDE helps verbose conceptual queries, not exact-token lookups — which is exactly where hybrid's keyword side (Chapter 06) already shines.
Decomposing atomic questions. Force-decomposing a simple question invents sub-questions that scatter retrieval and muddy the answer. Let the decomposer return the question unchanged when it's already atomic — and verify it does.
Multi-query that drowns precision. Five paraphrases times ten chunks each is fifty candidates, many marginal. Without reranking after fusion, multi-query can lower precision even as it raises recall. Pair it with the reranker from Chapter 07.
A router that misroutes silently. The classifier is itself a model that can be wrong, sending a compound question down the simple path. Log routing decisions and include misrouting in your evaluation, or the router becomes an invisible source of failure.

Practice — before you read the next chapter

Audit your real queries

Try decomposition by hand

Measure the latency tax

Takeaways

Users don't type retrievable queries. Query understanding repairs the query before it reaches the index, leaving the rest of the pipeline untouched.
Rewriting reshapes a messy query (and folds in conversation context); HyDE searches with a hypothetical answer because answers look more like chunks than questions do.
Multi-query widens the net for narrowly-phrased questions; decomposition splits compound questions into separately-retrievable parts; step-back zooms out for over-specific ones.
Routing sends different query types down different paths — powerful, but build it only for measured, distinct query classes and keep the routes few.
Every technique adds an LLM call before retrieval. Add them one at a time, measure quality against latency, and keep only what pays for itself.

Discussion

Reranking — the second-stage detail Generation — grounding, citation, refusal

Query understanding — rewrite, decompose, route

What you'll take away from this chapter

The query gap

Reshaping one query — rewriting and HyDE

Query rewriting

HyDE — the hypothetical answer trick

Splitting one query into several — multi-query and decomposition

Multi-query

Decomposition

Step-back — zooming out for grounding

Routing — different queries, different paths

The honest cost of all this

When this fails

Practice — before you read the next chapter

Audit your real queries

Try decomposition by hand

Measure the latency tax

Takeaways

Discussion

Related Tutorials

Query understanding — rewrite, decompose, route

What you'll take away from this chapter

The query gap

Reshaping one query — rewriting and HyDE

Query rewriting

HyDE — the hypothetical answer trick

Splitting one query into several — multi-query and decomposition

Multi-query

Decomposition

Step-back — zooming out for grounding

Routing — different queries, different paths

The honest cost of all this

When this fails

Practice — before you read the next chapter

Audit your real queries

Try decomposition by hand

Measure the latency tax

Takeaways

Discussion

Related Tutorials