Everything so far has improved how we search — better indexes, hybrid retrieval, reranking. This chapter improves what we search for. Because here is the uncomfortable reality of production: users do not type clean queries. They type "it's broken again" with no subject. They ask three questions in one sentence. They use your internal product name that appears nowhere in the docs. They paste an error and a paragraph of context and a "pls help." The query that arrives is rarely the query your retriever wants. Query understanding is the stage that sits between the user and the index, repairing and reshaping the query so that retrieval has a fighting chance.
Retrieval works best when the query looks like the answer. Embedding models place text by meaning, so a query phrased like the chunk that answers it lands near that chunk. But users don't phrase queries like answers — they phrase them like questions, often terse and context-dependent ones. "How much?" is a perfectly clear question to a human who saw the previous message and is meaningless to a retriever. The query-understanding stage exists to close that gap before the query reaches the index.
The simplest move: use a small, fast LLM to rewrite the messy query into a clean, retrievable one before it hits the index. Expand abbreviations, resolve internal jargon to the words your docs use, strip the "pls help," and — crucially in conversation — fold in context from earlier turns so "how much?" becomes "how much does the Pro plan cost?" Rewriting is the highest-value query technique in conversational systems precisely because of that context-folding, which is the whole subject of the next chapter.
HyDE (Hypothetical Document Embeddings) is a clever inversion. Instead of searching with the question, you first ask an LLM to write a hypothetical answer to the question — even a wrong one — and then search with that. Why does guessing help? Because an answer looks like the chunks you're trying to retrieve far more than a question does. The hypothetical answer might get facts wrong, but it gets the shape and vocabulary of a good answer right, and that shape is what the embedding model matches on. You're using the LLM's guess as a better-shaped search probe, then retrieving the real, correct chunks it points toward.
One phrasing of a question retrieves one neighbourhood of the vector space. But there are many ways to phrase the same question, each landing slightly differently. Multi-query generates several paraphrases of the user's query, retrieves for each, and fuses the results (with RRF from Chapter 06). It widens the net, catching relevant chunks that any single phrasing would have missed. Useful when recall is your problem and the query is genuinely answerable but phrased narrowly.
Some questions are secretly several questions. "Compare the refund policy for digital and physical goods" is two retrievals — digital refunds, physical refunds — and a comparison. A single search for the whole sentence retrieves chunks that are half-relevant to each half and fully relevant to neither. Decomposition uses an LLM to break the compound question into sub-questions, retrieves for each separately, and combines the evidence. This is the technique that rescues the multi-part questions naive RAG quietly fails.
# Decompose a compound question into answerable sub-questions,
# retrieve for each, then merge — using any LLM client.
import json
def decompose(question, llm):
"""Ask the LLM to split a compound question into sub-questions.
Returns the original wrapped in a list if it's already atomic."""
prompt = f"""Break this question into the minimal set of standalone
sub-questions needed to answer it fully. If it is already a single
question, return it unchanged. Reply with a JSON array of strings only.
Question: {question}"""
raw = llm.complete(prompt).strip()
try:
subs = json.loads(raw)
return subs if isinstance(subs, list) and subs else [question]
except json.JSONDecodeError:
return [question] # fall back to the original, never crash
def answer_compound(question, llm, retrieve):
"""retrieve(q) -> list[str] is your hybrid+rerank pipeline from Ch06–07."""
sub_questions = decompose(question, llm)
evidence = {}
for sub in sub_questions:
evidence[sub] = retrieve(sub) # separate retrieval per sub-question
return sub_questions, evidence
q = "Compare the refund policy for digital and physical goods."
subs, ev = answer_compound(q, llm, retrieve)
for s in subs:
print("•", s)
• What is the refund policy for digital goods?
• What is the refund policy for physical goods?
Each sub-question now retrieves the chunk that actually answers it, cleanly. The generation step (next chapter) then composes the comparison from two solid pieces of evidence instead of one muddy retrieval. Note the defensive parsing — if the LLM returns malformed JSON, the code falls back to treating the question as atomic rather than crashing. That habit, treating every LLM call as fallible, is what separates a robust pipeline from a demo.
A subtler technique: for a very specific question, first ask a more general "step-back" question, retrieve the broad principle, then retrieve the specifics, and give the model both. "Can I get a refund on a gift card bought in March under promo code SPRING?" is so specific it may match nothing. The step-back question — "what is the refund policy for gift cards?" — retrieves the governing rule, which usually contains or implies the specific answer. You hand the model the general principle and the specific details together. It's decomposition's cousin, aimed at over-specific rather than compound queries.
Not every query wants the same treatment. A factual lookup ("what's the refund window?") wants straight hybrid retrieval. A compound question wants decomposition. A query that's actually about a specific document the user names wants a metadata filter, not a semantic search. Routing classifies the incoming query and sends it down the right path. The router can be a small classifier or a quick LLM call that picks among a fixed set of known routes.
My take. Routing is powerful and also where over-engineering creeps in fastest — it's the "modular RAG" temptation from Chapter 01 made concrete. Build it only once you have measured distinct query classes that genuinely need distinct handling, and keep the routes few and named. A router with eleven branches is a maintenance burden and a new failure surface — the classifier itself can misroute. Two or three obvious routes, added when the data demands them, is almost always the right amount.
Every technique in this chapter adds an LLM call (or several) before retrieval even begins. That's latency the user feels and money you spend on every query. Query rewriting adds one fast call. Multi-query adds a call plus several extra retrievals. Decomposition adds a call plus a retrieval per sub-question. HyDE adds an answer-generation call. None are free, and stacking all of them turns a 200-millisecond retrieval into a multi-second pipeline.
| Technique | Use when | Added cost |
|---|---|---|
| Rewriting | Queries are messy or conversational | One fast LLM call |
| HyDE | Queries are short; answers are verbose | One answer-generation call |
| Multi-query | Recall is low; query is narrowly phrased | One call + N extra retrievals |
| Decomposition | Questions are compound or multi-part | One call + one retrieval per sub-question |
| Routing | You have measured distinct query classes | One classification step per query |
The discipline: add these one at a time, measure the quality gain against the latency cost on your eval set, and keep only the ones that pay for themselves. A naive system with none of these is the correct starting point — exactly as Chapter 01 argued. Reach here only when you've measured a specific query problem these techniques solve.
Collect fifty real user queries (or realistic ones). Sort them into buckets: clean and answerable as-is, messy/needs-rewriting, compound/needs-decomposition, over-specific/needs-step-back. The size of each bucket tells you which technique is worth building first — and how many of your queries need nothing at all.
Take your compound queries and run them through the decomposition code (or just an LLM prompt). Retrieve for the whole question and for each sub-question separately. Compare the chunks. The improvement on compound questions is usually dramatic and immediately convincing.
For one technique you're considering, time a query end-to-end with and without it. Put that number next to the quality gain on your eval set. This quality-per-millisecond view is the only honest basis for deciding whether a query technique earns its place in your pipeline.
Next chapter: Generation — grounding, citation, refusal. We've found the right chunks. Now the model has to answer from them — faithfully, with citations, and with the discipline to say "I don't know" when the evidence isn't there. This is where retrieval becomes an answer.
Sign in to join the discussion and post comments.
Sign in