Your hybrid retriever from the last chapter is fast and casts a wide net — it hands you, say, the fifty most promising chunks in a few milliseconds. But "promising" is not "best." The retriever was built for speed, and speed forces it to judge each chunk in isolation, never directly comparing the chunk against the actual question word for word. Reranking is the second stage that does exactly that: it takes the retriever's candidates and re-orders them with a slower, far sharper model that reads the query and each chunk together. It is one of the largest quality gains available for the least architectural disruption — a bolt-on, not a rebuild.
To understand reranking you have to understand the limitation it fixes. Your vector retriever uses what's called a bi-encoder: it embeds the query into one vector and every chunk into its own vector, ahead of time, completely independently. The chunk was turned into a vector months ago without ever having seen your query. At search time you just compare two pre-made vectors. That independence is precisely what makes retrieval fast — the chunk vectors are computed once and reused for every query forever — but it's also what limits its precision. The model never got to look at the query and the chunk side by side and ask "does this specific passage actually answer this specific question?"
A cross-encoder does exactly that. It takes the query and one chunk together, as a single joined input, and produces one number: how well this chunk answers this query. Because it sees both at once, it catches nuances the bi-encoder structurally cannot — negations, conditions, whether the passage is about the subject or merely mentions it. The catch: it has to run the model fresh for every query-chunk pair, so you cannot precompute anything. Running a cross-encoder over fifty million chunks per query is hopeless. Running it over the fifty candidates the retriever already narrowed things to is trivial. That asymmetry is the whole design.
The pattern is two numbers. Retrieve a generous candidate set — call it the retrieval depth, often 25 to 100 chunks — with your fast hybrid retriever. Then rerank those candidates with the cross-encoder and keep the true top few — call it the final k, often 3 to 8 — to hand to the model. The retriever's job shifts subtly: it no longer has to get the best chunk into the top 5, only into the top 50. That is a far easier job, and the reranker handles the precise ordering the retriever was never good at.
Choosing the depth is the real decision. Too shallow and the right chunk may never reach the reranker — if it's the 60th result and you only retrieve 50, no reranker can save it. Too deep and you pay to rerank candidates that were never plausible. The sweet spot is wide enough that the right answer is almost always somewhere in the candidate set, which you verify by measuring recall at your retrieval depth: if recall@50 is high but recall@5 is mediocre, reranking is exactly your missing piece — the answer is being found but not ranked, and reranking fixes ranking.
Here is reranking bolted onto the hybrid retriever from Chapter 06. The retriever returns many candidates; the cross-encoder scores each against the query; we keep the best.
# pip install sentence-transformers
from sentence_transformers import CrossEncoder
# a small, fast, well-regarded cross-encoder reranker
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query, candidates, final_k=5):
"""Re-order retriever candidates by true query-chunk relevance.
`candidates` is a list of chunk strings from the fast retriever."""
# the cross-encoder scores each (query, chunk) PAIR
pairs = [(query, chunk) for chunk in candidates]
scores = reranker.predict(pairs) # one score per pair
# sort candidates by score, highest first, keep final_k
ranked = sorted(zip(candidates, scores),
key=lambda x: x[1], reverse=True)
return ranked[:final_k]
# retrieve_depth=50 from the hybrid retriever, then rerank down to 5
query = "what happens to my data if I close my account"
candidates = hybrid_search(query, top_n=50) # from Chapter 06
for chunk, score in rerank(query, candidates, final_k=3):
print(f"{score:+.2f} {chunk[:70]}")
+8.41 When you close your account, all personal data is deleted within 30 ...
+2.07 Account closure is permanent and cannot be reversed once confirmed ...
-1.55 To close your account, go to Settings then Account then Close.
Notice the reranker's scores aren't similarities between 0 and 1 — they're raw relevance scores on the cross-encoder's own scale, which is fine because we only use them to sort. And notice the ordering: the chunk that directly answers "what happens to my data" scores far above the chunk about merely how to close the account, even though both are obviously about account closure and a bi-encoder might have rated them similarly. That discrimination — "about the right topic" versus "actually answers the question" — is what the cross-encoder buys you.
| Type | How it works | Trade-off |
|---|---|---|
| Cross-encoder | A dedicated small model scores each query-chunk pair. | Fast for a reranker, cheap, strong. The default. Self-hostable. |
| LLM-as-reranker | Ask a general LLM to score or order the candidates. | Flexible and can follow instructions ("prefer recent"), but slower and pricier per query. Reach for it when relevance is nuanced. |
| Listwise reranker | A model that orders the whole candidate list at once, not pair by pair. | Can reason about candidates relative to each other; more complex to run. Useful at the high end. |
For most systems the dedicated cross-encoder is the right starting point: it's the cheapest, it self-hosts, and the quality is excellent. Consider an LLM-based reranker only when relevance depends on instructions a fixed model can't take — "weight newer documents higher," "prefer official sources." Listwise rerankers are a refinement for when you've exhausted the simpler options and measured a reason to go further.
Recall the four-stack table from the last chapter. Reranking was the jump from hybrid's 0.87 to 0.92 — five points of recall@5 on top of an already-good system. Here's the fuller picture, including the cost side that the marketing leaves out:
| Setup | Recall@5 | Added latency per query |
|---|---|---|
| Hybrid, no rerank | 0.87 | — |
| Hybrid + cross-encoder (depth 50 → 5) | 0.92 | tens of milliseconds (self-hosted) |
| Hybrid + LLM reranker (depth 50 → 5) | ~0.92 | hundreds of ms + API cost |
The honest reading: a self-hosted cross-encoder gives you most of the available lift for a modest, predictable latency cost, which is why it's the default. The LLM reranker reaches similar quality but adds real latency and per-query expense — worth it only when its instruction-following flexibility solves a problem the cross-encoder can't. And the latency you add is partly tunable: rerank depth 30 instead of 50 is faster and usually nearly as good. Measure on your traffic.
My take. Reranking is the second thing to add, right after hybrid search, and the order matters. Hybrid fixes recall — getting the right chunk into the candidate set. Reranking fixes precision — getting it to the top of that set. Adding a reranker on top of pure vector search, while skipping hybrid, leaves recall gains on the table: you're carefully re-ordering a candidate set that's still missing answers. Fix what's findable first, then fix the ordering.
On your eval set, measure recall at your retrieval depth (say recall@50) and recall at your final k (recall@5) without reranking. A large gap — answers found at 50 but not ranked into 5 — is a direct prediction of how much reranking will help. A small gap means your retriever already ranks well and reranking will add little.
Take the code above, feed it the candidates from your hybrid retriever, and measure recall@5 before and after reranking on your eval set. Time it too. You'll have a concrete quality-versus-latency number for your own system, which is the only number that should drive the decision.
Rerank with depth 20, 50, and 100, holding final k at 5. Plot recall@5 and added latency against depth. You'll see recall rise then flatten while latency climbs linearly — your ideal depth is where the recall curve flattens. This is the same "find the plateau" discipline from Chapter 03, applied to a different knob.
Next chapter: Query understanding — rewrite, decompose, route. So far we've improved how we search; now we improve what we search for. Real user queries are messy, ambiguous, and sometimes several questions at once. We'll fix the query before it ever hits the index.
Sign in to join the discussion and post comments.
Sign in