On this tutorial

RAG: A Field Manual for Building LLM Systems That Use Your Data

Foundations

Data

Retrieval

Generation & Evaluation

Specialized verticals

Production

Closing

Reranking — the second-stage detail

Your hybrid retriever from the last chapter is fast and casts a wide net — it hands you, say, the fifty most promising chunks in a few milliseconds. But "promising" is not "best." The retriever was built for speed, and speed forces it to judge each chunk in isolation, never directly comparing the chunk against the actual question word for word. Reranking is the second stage that does exactly that: it takes the retriever's candidates and re-orders them with a slower, far sharper model that reads the query and each chunk together. It is one of the largest quality gains available for the least architectural disruption — a bolt-on, not a rebuild.

What you'll take away from this chapter

Why the retriever can't just be precise in the first place — the bi-encoder bottleneck
How a cross-encoder reranker reads query and chunk together, and why that's sharper
The retrieve-many-then-rerank-few pattern, and how to choose the two numbers
The three kinds of reranker — cross-encoder, LLM-as-reranker, listwise — and when each fits
The honest lift reranking adds and the latency it costs

Why retrieval can't already be this good

To understand reranking you have to understand the limitation it fixes. Your vector retriever uses what's called a bi-encoder: it embeds the query into one vector and every chunk into its own vector, ahead of time, completely independently. The chunk was turned into a vector months ago without ever having seen your query. At search time you just compare two pre-made vectors. That independence is precisely what makes retrieval fast — the chunk vectors are computed once and reused for every query forever — but it's also what limits its precision. The model never got to look at the query and the chunk side by side and ask "does this specific passage actually answer this specific question?"

A cross-encoder does exactly that. It takes the query and one chunk together, as a single joined input, and produces one number: how well this chunk answers this query. Because it sees both at once, it catches nuances the bi-encoder structurally cannot — negations, conditions, whether the passage is about the subject or merely mentions it. The catch: it has to run the model fresh for every query-chunk pair, so you cannot precompute anything. Running a cross-encoder over fifty million chunks per query is hopeless. Running it over the fifty candidates the retriever already narrowed things to is trivial. That asymmetry is the whole design.

The bi-encoder embeds query and chunk separately — fast, because chunks are pre-embedded, but it never compares them directly. The cross-encoder reads the pair together for a sharper score, at the cost of running per query-chunk pair. Use the fast one to narrow, the sharp one to order.

The retrieve-many-then-rerank-few pattern

The pattern is two numbers. Retrieve a generous candidate set — call it the retrieval depth, often 25 to 100 chunks — with your fast hybrid retriever. Then rerank those candidates with the cross-encoder and keep the true top few — call it the final k, often 3 to 8 — to hand to the model. The retriever's job shifts subtly: it no longer has to get the best chunk into the top 5, only into the top 50. That is a far easier job, and the reranker handles the precise ordering the retriever was never good at.

Choosing the depth is the real decision. Too shallow and the right chunk may never reach the reranker — if it's the 60th result and you only retrieve 50, no reranker can save it. Too deep and you pay to rerank candidates that were never plausible. The sweet spot is wide enough that the right answer is almost always somewhere in the candidate set, which you verify by measuring recall at your retrieval depth: if recall@50 is high but recall@5 is mediocre, reranking is exactly your missing piece — the answer is being found but not ranked, and reranking fixes ranking.

A cross-encoder reranker in code

Here is reranking bolted onto the hybrid retriever from Chapter 06. The retriever returns many candidates; the cross-encoder scores each against the query; we keep the best.

# pip install sentence-transformers
from sentence_transformers import CrossEncoder

# a small, fast, well-regarded cross-encoder reranker
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query, candidates, final_k=5):
    """Re-order retriever candidates by true query-chunk relevance.
    `candidates` is a list of chunk strings from the fast retriever."""
    # the cross-encoder scores each (query, chunk) PAIR
    pairs = [(query, chunk) for chunk in candidates]
    scores = reranker.predict(pairs)            # one score per pair
    # sort candidates by score, highest first, keep final_k
    ranked = sorted(zip(candidates, scores),
                    key=lambda x: x[1], reverse=True)
    return ranked[:final_k]

# retrieve_depth=50 from the hybrid retriever, then rerank down to 5
query = "what happens to my data if I close my account"
candidates = hybrid_search(query, top_n=50)     # from Chapter 06
for chunk, score in rerank(query, candidates, final_k=3):
    print(f"{score:+.2f}  {chunk[:70]}")

+8.41  When you close your account, all personal data is deleted within 30 ...
+2.07  Account closure is permanent and cannot be reversed once confirmed ...
-1.55  To close your account, go to Settings then Account then Close.

Notice the reranker's scores aren't similarities between 0 and 1 — they're raw relevance scores on the cross-encoder's own scale, which is fine because we only use them to sort. And notice the ordering: the chunk that directly answers "what happens to my data" scores far above the chunk about merely how to close the account, even though both are obviously about account closure and a bi-encoder might have rated them similarly. That discrimination — "about the right topic" versus "actually answers the question" — is what the cross-encoder buys you.

Three kinds of reranker

Type	How it works	Trade-off
Cross-encoder	A dedicated small model scores each query-chunk pair.	Fast for a reranker, cheap, strong. The default. Self-hostable.
LLM-as-reranker	Ask a general LLM to score or order the candidates.	Flexible and can follow instructions ("prefer recent"), but slower and pricier per query. Reach for it when relevance is nuanced.
Listwise reranker	A model that orders the whole candidate list at once, not pair by pair.	Can reason about candidates relative to each other; more complex to run. Useful at the high end.

For most systems the dedicated cross-encoder is the right starting point: it's the cheapest, it self-hosts, and the quality is excellent. Consider an LLM-based reranker only when relevance depends on instructions a fixed model can't take — "weight newer documents higher," "prefer official sources." Listwise rerankers are a refinement for when you've exhausted the simpler options and measured a reason to go further.

The lift, and the cost

Recall the four-stack table from the last chapter. Reranking was the jump from hybrid's 0.87 to 0.92 — five points of recall@5 on top of an already-good system. Here's the fuller picture, including the cost side that the marketing leaves out:

Setup	Recall@5	Added latency per query
Hybrid, no rerank	0.87	—
Hybrid + cross-encoder (depth 50 → 5)	0.92	tens of milliseconds (self-hosted)
Hybrid + LLM reranker (depth 50 → 5)	~0.92	hundreds of ms + API cost

The honest reading: a self-hosted cross-encoder gives you most of the available lift for a modest, predictable latency cost, which is why it's the default. The LLM reranker reaches similar quality but adds real latency and per-query expense — worth it only when its instruction-following flexibility solves a problem the cross-encoder can't. And the latency you add is partly tunable: rerank depth 30 instead of 50 is faster and usually nearly as good. Measure on your traffic.

My take. Reranking is the second thing to add, right after hybrid search, and the order matters. Hybrid fixes recall — getting the right chunk into the candidate set. Reranking fixes precision — getting it to the top of that set. Adding a reranker on top of pure vector search, while skipping hybrid, leaves recall gains on the table: you're carefully re-ordering a candidate set that's still missing answers. Fix what's findable first, then fix the ordering.

When this fails

Rerank depth too shallow. If you retrieve only 10 candidates and rerank to 5, the reranker can only re-order what the retriever found. The chunk sitting at retrieval rank 15 never gets a chance. Set depth from your recall curve, not from habit.
Reranking without hybrid first. A reranker improves ordering, not recall. If the right chunk isn't in the candidate set, no reranker conjures it. Add hybrid search (recall) before reranking (precision).
Treating cross-encoder scores as probabilities. They're unbounded relevance scores on the model's own scale, useful only for sorting and rough thresholding. Don't display them as "92% confident" — they're not calibrated for that.
Reranking far more than you'll use. Reranking 200 candidates to return 5 mostly pays to score chunks that never had a chance. Depth should be wide enough to catch the answer and no wider; the cost is linear in depth.
Ignoring the latency budget. Reranking adds real time to every query. In an interactive product that has a perceptible-latency ceiling, an LLM reranker at depth 100 can blow it. Measure end-to-end latency, not just quality, and tune depth and reranker type to fit the budget from Chapter 05's operational concerns.

Practice — before you read the next chapter

Measure your recall gap

On your eval set, measure recall at your retrieval depth (say recall@50) and recall at your final k (recall@5) without reranking. A large gap — answers found at 50 but not ranked into 5 — is a direct prediction of how much reranking will help. A small gap means your retriever already ranks well and reranking will add little.

Bolt on a reranker

Take the code above, feed it the candidates from your hybrid retriever, and measure recall@5 before and after reranking on your eval set. Time it too. You'll have a concrete quality-versus-latency number for your own system, which is the only number that should drive the decision.

Sweep the depth

Rerank with depth 20, 50, and 100, holding final k at 5. Plot recall@5 and added latency against depth. You'll see recall rise then flatten while latency climbs linearly — your ideal depth is where the recall curve flattens. This is the same "find the plateau" discipline from Chapter 03, applied to a different knob.

Takeaways

The retriever uses a bi-encoder — fast because chunks are pre-embedded, but it never compares query and chunk directly. That's its precision ceiling.
A cross-encoder reranker reads query and chunk together for a sharper relevance score, at the cost of running per pair — so you run it only on the retriever's candidates.
The pattern is retrieve-many (depth 25–100) then rerank-few (final k 3–8). Set depth from your recall curve.
Cross-encoder is the default reranker; LLM-as-reranker buys instruction-following flexibility at a latency and cost premium; listwise is a high-end refinement.
Reranking adds meaningful precision for modest latency — but add hybrid search first. Recall before precision; you can't re-order an answer that was never retrieved.

Next chapter: Query understanding — rewrite, decompose, route. So far we've improved how we search; now we improve what we search for. Real user queries are messy, ambiguous, and sometimes several questions at once. We'll fix the query before it ever hits the index.

Discussion

Retrieval algorithms — vector, lexical, hybrid Query understanding — rewrite, decompose, route