Semantic search feels like magic when you first see it — "how do I stop being billed" finding the chunk about cancelling a subscription, no shared words required. So it's tempting to conclude that vector search is simply better than the old keyword approach and be done with it. That conclusion is wrong, and the way it's wrong will cost you. Vector search has a blind spot exactly where keywords are strongest: exact terms, product codes, names, error messages, rare jargon. The strongest retrieval in 2026 is not vector or keyword. It's both, fused.
This chapter explains why each method fails where the other succeeds, how to combine them so the combination beats either alone, and — with numbers on the same set of queries — how much that combination actually buys you.
Vector search matches on meaning, which is exactly what you want for "how do I get my money back" → the refund chunk. But meaning-matching is a weakness when the user's intent is tied to an exact string that carries little semantic content of its own. Consider these queries:
The pattern: when the right answer hinges on a rare or exact token rather than a concept, semantic similarity dilutes it. Keyword search has the opposite profile — it nails exact tokens and is blind to meaning. Each method's strength is the other's weakness, which is the whole argument for combining them.
The workhorse of keyword search is an algorithm called BM25. You don't need its formula, but its three intuitions are worth holding because they explain its behaviour:
BM25 is decades old, runs anywhere, needs no model and no GPU, and is extremely fast. It is not a legacy curiosity you tolerate; it is a genuinely strong retriever for the exact-token queries vector search fumbles. Treat it as a peer, not a fallback.
So you run both retrievers and get two ranked lists. How do you merge them into one? The scores aren't comparable — a cosine similarity of 0.8 and a BM25 score of 14.2 live on different scales, and normalising them is fiddly and fragile. The robust answer is to ignore the scores entirely and fuse on rank instead, using Reciprocal Rank Fusion (RRF).
RRF is beautifully simple: a chunk's fused score is the sum, across both lists, of 1 / (k + rank), where rank is its position in that list and k is a small constant (60 is the standard). A chunk ranked first in both lists scores highly. A chunk ranked first in one list and absent from the other still scores respectably. Because it uses position, not raw score, it sidesteps the scale-mismatch problem completely.
Here is the whole hybrid retriever: BM25 over the chunks, vector search over the same chunks, then RRF to fuse the two rankings into one. It's less code than people expect.
# pip install rank-bm25 sentence-transformers numpy
import numpy as np
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
chunks = [
"To cancel your subscription, open Account then Billing.",
"Refunds are issued within 30 days of purchase.",
"Error E-4021 means the payment gateway timed out; retry.",
"Upgrade or downgrade your plan at any time from Settings.",
]
# --- keyword side: BM25 over tokenised chunks ---
tokenised = [c.lower().split() for c in chunks]
bm25 = BM25Okapi(tokenised)
# --- vector side: embed all chunks once ---
model = SentenceTransformer("BAAI/bge-base-en-v1.5")
chunk_vecs = model.encode(chunks, normalize_embeddings=True)
def rrf_fuse(ranked_lists, k=60, top_n=5):
"""Fuse multiple ranked lists of chunk-indices via Reciprocal Rank Fusion.
Scores by position, so the two methods' incomparable scores never meet."""
scores = {}
for ranking in ranked_lists:
for rank, idx in enumerate(ranking): # rank starts at 0
scores[idx] = scores.get(idx, 0) + 1 / (k + rank + 1)
ordered = sorted(scores, key=scores.get, reverse=True)
return ordered[:top_n]
def hybrid_search(query, top_n=3):
# keyword ranking: BM25 scores → indices sorted high to low
bm25_scores = bm25.get_scores(query.lower().split())
kw_ranking = list(np.argsort(-bm25_scores))
# vector ranking: cosine sim → indices sorted high to low
q = model.encode(query, normalize_embeddings=True)
vec_scores = chunk_vecs @ q
vec_ranking = list(np.argsort(-vec_scores))
fused = rrf_fuse([kw_ranking, vec_ranking], top_n=top_n)
return [chunks[i] for i in fused]
print("Q: how do I stop being billed")
for c in hybrid_search("how do I stop being billed"): print(" ", c)
print("\nQ: error E-4021")
for c in hybrid_search("error E-4021"): print(" ", c)
Q: how do I stop being billed
To cancel your subscription, open Account then Billing.
Refunds are issued within 30 days of purchase.
Upgrade or downgrade your plan at any time from Settings.
Q: error E-4021
Error E-4021 means the payment gateway timed out; retry.
To cancel your subscription, open Account then Billing.
Refunds are issued within 30 days of purchase.
Look at what hybrid bought you across those two queries. The first — a paraphrase with no shared keywords — was carried by the vector side, which understood "stop being billed" means cancellation. The second — an exact error code — was carried by the BM25 side, which matched "E-4021" precisely where vector search would have drifted toward other error chunks. One retriever, both strengths. Neither query type is sacrificed.
Here is the shape of results you'll see comparing four retrieval stacks on the same query set — a mix of paraphrase queries and exact-token queries, which is what real traffic looks like. The numbers are illustrative of the pattern; your corpus will shift them, but the ordering is consistent and the lesson is durable.
| Stack | Recall@5 | Added latency | Added complexity |
|---|---|---|---|
| Vector only | 0.79 | baseline | baseline |
| BM25 only | 0.74 | very low | low |
| Hybrid (vector + BM25 + RRF) | 0.87 | low | moderate |
| Hybrid + reranking | 0.92 | moderate | higher |
Read this honestly. Vector alone and BM25 alone are close, each winning on different query types and averaging out similar. Hybrid jumps well above either — the eight-point gain over vector-only is the complementary-strengths effect made real. And reranking (the subject of the next chapter) adds another five points on top. The progression vector → hybrid → hybrid+rerank is the standard path from a naive system to a strong one, and each step's cost is visible in the table so you can decide how far to walk it.
My take. Hybrid search is the highest return-on-effort upgrade in the whole retrieval stack. It's a clear quality jump for moderate added complexity, and unlike many improvements it helps a broad range of queries rather than a narrow slice. If your naive vector system is underperforming and you can only do one thing this week, add BM25 and fuse with RRF. Reranking is the next step, not the first.
Go through your fifty-question eval set from Chapter 04 and mark which questions hinge on an exact token — a code, a name, a precise reference. Those are the questions pure vector search will tend to miss and where hybrid will help most. The proportion tells you how much hybrid is worth for your traffic.
Take the code above, drop in fifty of your real chunks, and run both a paraphrase query and an exact-token query. Watch hybrid handle both where each single method would handle only one. Then compare hybrid's results against vector-only on your full eval set and measure the recall gain.
Change k in the RRF function from 60 to 1, then to 1000, and observe how the fused ranking shifts. Small k sharply favours top-ranked items; large k flattens the contribution of rank. Understanding this knob turns RRF from a magic incantation into a tool you control.
Next chapter: Reranking — the second-stage detail. Retrieval gets you a good candidate set fast; reranking re-orders that set with a slower, sharper model to put the truly best chunks on top. We'll see exactly how much it adds, and what it costs in latency.
Sign in to join the discussion and post comments.
Sign in