On this tutorial

RAG: A Field Manual for Building LLM Systems That Use Your Data

Foundations

Data

Retrieval

Generation & Evaluation

Specialized verticals

Production

Closing

Retrieval algorithms — vector, lexical, hybrid

Semantic search feels like magic when you first see it — "how do I stop being billed" finding the chunk about cancelling a subscription, no shared words required. So it's tempting to conclude that vector search is simply better than the old keyword approach and be done with it. That conclusion is wrong, and the way it's wrong will cost you. Vector search has a blind spot exactly where keywords are strongest: exact terms, product codes, names, error messages, rare jargon. The strongest retrieval in 2026 is not vector or keyword. It's both, fused.

This chapter explains why each method fails where the other succeeds, how to combine them so the combination beats either alone, and — with numbers on the same set of queries — how much that combination actually buys you.

What you'll take away from this chapter

The specific queries where pure vector search quietly fails
How keyword search (BM25) works, and why it's far from obsolete
How to fuse two rankings into one with Reciprocal Rank Fusion — the simple, robust default
A measured comparison of four retrieval stacks on the same queries
When hybrid is worth the extra moving parts, and when pure vector is enough

Where vector search quietly fails

Vector search matches on meaning, which is exactly what you want for "how do I get my money back" → the refund chunk. But meaning-matching is a weakness when the user's intent is tied to an exact string that carries little semantic content of its own. Consider these queries:

"error E-4021" — the meaning of "E-4021" is nothing; the exact string is everything. A vector model may map it near other error codes and miss the one that matters.
"the Helvetica Neue licence" — a specific product name. Vector search might helpfully return chunks about "fonts" or "typography licensing" and bury the one exact match.
"section 7.3.2" — a precise reference. Semantics won't distinguish it from 7.3.1.
"Dr. Aanya Krishnamurthy" — a rare name. Embedding models represent rare tokens weakly, so the exact-match chunk can sink.

The pattern: when the right answer hinges on a rare or exact token rather than a concept, semantic similarity dilutes it. Keyword search has the opposite profile — it nails exact tokens and is blind to meaning. Each method's strength is the other's weakness, which is the whole argument for combining them.

The two methods are near-mirror images. Each excels exactly where the other struggles. Combining them isn't greedy — it's the natural response to complementary strengths.

How keyword search works — BM25, briefly

The workhorse of keyword search is an algorithm called BM25. You don't need its formula, but its three intuitions are worth holding because they explain its behaviour:

Rare words count more. A query word that appears in few documents (like "E-4021") is a strong signal; a word in almost every document (like "the") is nearly worthless. BM25 weights matches by rarity.
Repetition has diminishing returns. A document mentioning "refund" ten times is more relevant than one mentioning it once — but not ten times more. BM25 saturates, so keyword stuffing doesn't dominate.
Shorter documents matching get a slight edge. A match in a focused short chunk usually means more than the same match buried in a long one.

BM25 is decades old, runs anywhere, needs no model and no GPU, and is extremely fast. It is not a legacy curiosity you tolerate; it is a genuinely strong retriever for the exact-token queries vector search fumbles. Treat it as a peer, not a fallback.

Fusing two rankings — Reciprocal Rank Fusion

So you run both retrievers and get two ranked lists. How do you merge them into one? The scores aren't comparable — a cosine similarity of 0.8 and a BM25 score of 14.2 live on different scales, and normalising them is fiddly and fragile. The robust answer is to ignore the scores entirely and fuse on rank instead, using Reciprocal Rank Fusion (RRF).

RRF is beautifully simple: a chunk's fused score is the sum, across both lists, of 1 / (k + rank), where rank is its position in that list and k is a small constant (60 is the standard). A chunk ranked first in both lists scores highly. A chunk ranked first in one list and absent from the other still scores respectably. Because it uses position, not raw score, it sidesteps the scale-mismatch problem completely.

Neither list had chunk A and chunk B both at the top — but they agreed enough that fusion surfaced them. RRF rewards chunks that multiple methods consider relevant.

Hybrid retrieval in code

Here is the whole hybrid retriever: BM25 over the chunks, vector search over the same chunks, then RRF to fuse the two rankings into one. It's less code than people expect.

# pip install rank-bm25 sentence-transformers numpy
import numpy as np
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer

chunks = [
    "To cancel your subscription, open Account then Billing.",
    "Refunds are issued within 30 days of purchase.",
    "Error E-4021 means the payment gateway timed out; retry.",
    "Upgrade or downgrade your plan at any time from Settings.",
]

# --- keyword side: BM25 over tokenised chunks ---
tokenised = [c.lower().split() for c in chunks]
bm25 = BM25Okapi(tokenised)

# --- vector side: embed all chunks once ---
model = SentenceTransformer("BAAI/bge-base-en-v1.5")
chunk_vecs = model.encode(chunks, normalize_embeddings=True)

def rrf_fuse(ranked_lists, k=60, top_n=5):
    """Fuse multiple ranked lists of chunk-indices via Reciprocal Rank Fusion.
    Scores by position, so the two methods' incomparable scores never meet."""
    scores = {}
    for ranking in ranked_lists:
        for rank, idx in enumerate(ranking):       # rank starts at 0
            scores[idx] = scores.get(idx, 0) + 1 / (k + rank + 1)
    ordered = sorted(scores, key=scores.get, reverse=True)
    return ordered[:top_n]

def hybrid_search(query, top_n=3):
    # keyword ranking: BM25 scores → indices sorted high to low
    bm25_scores = bm25.get_scores(query.lower().split())
    kw_ranking = list(np.argsort(-bm25_scores))

    # vector ranking: cosine sim → indices sorted high to low
    q = model.encode(query, normalize_embeddings=True)
    vec_scores = chunk_vecs @ q
    vec_ranking = list(np.argsort(-vec_scores))

    fused = rrf_fuse([kw_ranking, vec_ranking], top_n=top_n)
    return [chunks[i] for i in fused]

print("Q: how do I stop being billed")
for c in hybrid_search("how do I stop being billed"): print("  ", c)
print("\nQ: error E-4021")
for c in hybrid_search("error E-4021"): print("  ", c)

Q: how do I stop being billed
   To cancel your subscription, open Account then Billing.
   Refunds are issued within 30 days of purchase.
   Upgrade or downgrade your plan at any time from Settings.

Q: error E-4021
   Error E-4021 means the payment gateway timed out; retry.
   To cancel your subscription, open Account then Billing.
   Refunds are issued within 30 days of purchase.

Look at what hybrid bought you across those two queries. The first — a paraphrase with no shared keywords — was carried by the vector side, which understood "stop being billed" means cancellation. The second — an exact error code — was carried by the BM25 side, which matched "E-4021" precisely where vector search would have drifted toward other error chunks. One retriever, both strengths. Neither query type is sacrificed.

Four stacks, measured

Here is the shape of results you'll see comparing four retrieval stacks on the same query set — a mix of paraphrase queries and exact-token queries, which is what real traffic looks like. The numbers are illustrative of the pattern; your corpus will shift them, but the ordering is consistent and the lesson is durable.

Stack	Recall@5	Added latency	Added complexity
Vector only	0.79	baseline	baseline
BM25 only	0.74	very low	low
Hybrid (vector + BM25 + RRF)	0.87	low	moderate
Hybrid + reranking	0.92	moderate	higher

Read this honestly. Vector alone and BM25 alone are close, each winning on different query types and averaging out similar. Hybrid jumps well above either — the eight-point gain over vector-only is the complementary-strengths effect made real. And reranking (the subject of the next chapter) adds another five points on top. The progression vector → hybrid → hybrid+rerank is the standard path from a naive system to a strong one, and each step's cost is visible in the table so you can decide how far to walk it.

My take. Hybrid search is the highest return-on-effort upgrade in the whole retrieval stack. It's a clear quality jump for moderate added complexity, and unlike many improvements it helps a broad range of queries rather than a narrow slice. If your naive vector system is underperforming and you can only do one thing this week, add BM25 and fuse with RRF. Reranking is the next step, not the first.

When this fails

Normalising scores instead of fusing ranks. Trying to put cosine similarity and BM25 scores on the same scale is fragile and breaks when score distributions shift. RRF avoids the problem entirely by using rank. Reach for score-normalisation only if you have measured a reason RRF isn't enough.
Assuming hybrid always wins. On a corpus of pure natural-language prose with no codes, names, or jargon, vector-only may match hybrid and save you the complexity. Measure on your queries before adding the keyword side — sometimes you genuinely don't need it.
Tokenising BM25 carelessly. BM25 depends on how you split text into tokens. Splitting "E-4021" into "e" and "4021", or lowercasing away a meaningful case distinction, throws away the exact-match power that justified adding BM25. Mind the tokeniser.
Forgetting BM25 needs its own index. Hybrid means maintaining two indexes over the same chunks — vector and keyword — kept in sync. When you add or delete a chunk, both must update. A drifted keyword index silently degrades half your retrieval.
Over-weighting one side. Some implementations let you weight vector vs keyword contributions. Cranking it to mostly-vector quietly recreates the blind spot you added BM25 to fix. Start balanced; only re-weight with evidence from your eval set.

Practice — before you read the next chapter

Find your vector blind spots

Go through your fifty-question eval set from Chapter 04 and mark which questions hinge on an exact token — a code, a name, a precise reference. Those are the questions pure vector search will tend to miss and where hybrid will help most. The proportion tells you how much hybrid is worth for your traffic.

Run the hybrid retriever

Take the code above, drop in fifty of your real chunks, and run both a paraphrase query and an exact-token query. Watch hybrid handle both where each single method would handle only one. Then compare hybrid's results against vector-only on your full eval set and measure the recall gain.

Break RRF on purpose

Change k in the RRF function from 60 to 1, then to 1000, and observe how the fused ranking shifts. Small k sharply favours top-ranked items; large k flattens the contribution of rank. Understanding this knob turns RRF from a magic incantation into a tool you control.

Takeaways

Vector search matches meaning and fails on exact tokens — codes, names, references, rare jargon. Keyword search is the mirror image.
BM25 is a strong, fast, model-free retriever. Treat it as a peer to vector search, not a legacy fallback.
Fuse the two rankings with Reciprocal Rank Fusion. It scores by position, sidestepping the incompatible-scores problem cleanly.
Hybrid reliably beats either method alone — typically a substantial recall gain for moderate added complexity. It's the highest return-on-effort retrieval upgrade.
Maintain both indexes in sync, mind your BM25 tokeniser, and measure before assuming hybrid is needed — pure-prose corpora sometimes don't need it.

Next chapter: Reranking — the second-stage detail. Retrieval gets you a good candidate set fast; reranking re-orders that set with a slower, sharper model to put the truly best chunks on top. We'll see exactly how much it adds, and what it costs in latency.

Discussion

The vector index — databases and the geometry Reranking — the second-stage detail

Retrieval algorithms — vector, lexical, hybrid

What you'll take away from this chapter

The specific queries where pure vector search quietly fails
How keyword search (BM25) works, and why it's far from obsolete
How to fuse two rankings into one with Reciprocal Rank Fusion — the simple, robust default
A measured comparison of four retrieval stacks on the same queries
When hybrid is worth the extra moving parts, and when pure vector is enough

Where vector search quietly fails

"error E-4021" — the meaning of "E-4021" is nothing; the exact string is everything. A vector model may map it near other error codes and miss the one that matters.
"the Helvetica Neue licence" — a specific product name. Vector search might helpfully return chunks about "fonts" or "typography licensing" and bury the one exact match.
"section 7.3.2" — a precise reference. Semantics won't distinguish it from 7.3.1.
"Dr. Aanya Krishnamurthy" — a rare name. Embedding models represent rare tokens weakly, so the exact-match chunk can sink.

The two methods are near-mirror images. Each excels exactly where the other struggles. Combining them isn't greedy — it's the natural response to complementary strengths.

How keyword search works — BM25, briefly

The workhorse of keyword search is an algorithm called BM25. You don't need its formula, but its three intuitions are worth holding because they explain its behaviour:

Rare words count more. A query word that appears in few documents (like "E-4021") is a strong signal; a word in almost every document (like "the") is nearly worthless. BM25 weights matches by rarity.
Repetition has diminishing returns. A document mentioning "refund" ten times is more relevant than one mentioning it once — but not ten times more. BM25 saturates, so keyword stuffing doesn't dominate.
Shorter documents matching get a slight edge. A match in a focused short chunk usually means more than the same match buried in a long one.

Fusing two rankings — Reciprocal Rank Fusion

Neither list had chunk A and chunk B both at the top — but they agreed enough that fusion surfaced them. RRF rewards chunks that multiple methods consider relevant.

Hybrid retrieval in code

Here is the whole hybrid retriever: BM25 over the chunks, vector search over the same chunks, then RRF to fuse the two rankings into one. It's less code than people expect.

# pip install rank-bm25 sentence-transformers numpy
import numpy as np
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer

chunks = [
    "To cancel your subscription, open Account then Billing.",
    "Refunds are issued within 30 days of purchase.",
    "Error E-4021 means the payment gateway timed out; retry.",
    "Upgrade or downgrade your plan at any time from Settings.",
]

# --- keyword side: BM25 over tokenised chunks ---
tokenised = [c.lower().split() for c in chunks]
bm25 = BM25Okapi(tokenised)

# --- vector side: embed all chunks once ---
model = SentenceTransformer("BAAI/bge-base-en-v1.5")
chunk_vecs = model.encode(chunks, normalize_embeddings=True)

def rrf_fuse(ranked_lists, k=60, top_n=5):
    """Fuse multiple ranked lists of chunk-indices via Reciprocal Rank Fusion.
    Scores by position, so the two methods' incomparable scores never meet."""
    scores = {}
    for ranking in ranked_lists:
        for rank, idx in enumerate(ranking):       # rank starts at 0
            scores[idx] = scores.get(idx, 0) + 1 / (k + rank + 1)
    ordered = sorted(scores, key=scores.get, reverse=True)
    return ordered[:top_n]

def hybrid_search(query, top_n=3):
    # keyword ranking: BM25 scores → indices sorted high to low
    bm25_scores = bm25.get_scores(query.lower().split())
    kw_ranking = list(np.argsort(-bm25_scores))

    # vector ranking: cosine sim → indices sorted high to low
    q = model.encode(query, normalize_embeddings=True)
    vec_scores = chunk_vecs @ q
    vec_ranking = list(np.argsort(-vec_scores))

    fused = rrf_fuse([kw_ranking, vec_ranking], top_n=top_n)
    return [chunks[i] for i in fused]

print("Q: how do I stop being billed")
for c in hybrid_search("how do I stop being billed"): print("  ", c)
print("\nQ: error E-4021")
for c in hybrid_search("error E-4021"): print("  ", c)

Q: how do I stop being billed
   To cancel your subscription, open Account then Billing.
   Refunds are issued within 30 days of purchase.
   Upgrade or downgrade your plan at any time from Settings.

Q: error E-4021
   Error E-4021 means the payment gateway timed out; retry.
   To cancel your subscription, open Account then Billing.
   Refunds are issued within 30 days of purchase.

Four stacks, measured

Stack	Recall@5	Added latency	Added complexity
Vector only	0.79	baseline	baseline
BM25 only	0.74	very low	low
Hybrid (vector + BM25 + RRF)	0.87	low	moderate
Hybrid + reranking	0.92	moderate	higher

My take. Hybrid search is the highest return-on-effort upgrade in the whole retrieval stack. It's a clear quality jump for moderate added complexity, and unlike many improvements it helps a broad range of queries rather than a narrow slice. If your naive vector system is underperforming and you can only do one thing this week, add BM25 and fuse with RRF. Reranking is the next step, not the first.

When this fails

Normalising scores instead of fusing ranks. Trying to put cosine similarity and BM25 scores on the same scale is fragile and breaks when score distributions shift. RRF avoids the problem entirely by using rank. Reach for score-normalisation only if you have measured a reason RRF isn't enough.
Assuming hybrid always wins. On a corpus of pure natural-language prose with no codes, names, or jargon, vector-only may match hybrid and save you the complexity. Measure on your queries before adding the keyword side — sometimes you genuinely don't need it.
Tokenising BM25 carelessly. BM25 depends on how you split text into tokens. Splitting "E-4021" into "e" and "4021", or lowercasing away a meaningful case distinction, throws away the exact-match power that justified adding BM25. Mind the tokeniser.
Forgetting BM25 needs its own index. Hybrid means maintaining two indexes over the same chunks — vector and keyword — kept in sync. When you add or delete a chunk, both must update. A drifted keyword index silently degrades half your retrieval.
Over-weighting one side. Some implementations let you weight vector vs keyword contributions. Cranking it to mostly-vector quietly recreates the blind spot you added BM25 to fix. Start balanced; only re-weight with evidence from your eval set.

Practice — before you read the next chapter

Find your vector blind spots

Run the hybrid retriever

Break RRF on purpose

Takeaways

Vector search matches meaning and fails on exact tokens — codes, names, references, rare jargon. Keyword search is the mirror image.
BM25 is a strong, fast, model-free retriever. Treat it as a peer to vector search, not a legacy fallback.
Fuse the two rankings with Reciprocal Rank Fusion. It scores by position, sidestepping the incompatible-scores problem cleanly.
Hybrid reliably beats either method alone — typically a substantial recall gain for moderate added complexity. It's the highest return-on-effort retrieval upgrade.
Maintain both indexes in sync, mind your BM25 tokeniser, and measure before assuming hybrid is needed — pure-prose corpora sometimes don't need it.

Discussion

The vector index — databases and the geometry Reranking — the second-stage detail

Retrieval algorithms — vector, lexical, hybrid

What you'll take away from this chapter

Where vector search quietly fails

How keyword search works — BM25, briefly

Fusing two rankings — Reciprocal Rank Fusion

Hybrid retrieval in code

Four stacks, measured

When this fails

Practice — before you read the next chapter

Find your vector blind spots

Run the hybrid retriever

Break RRF on purpose

Takeaways

Discussion

Related Tutorials

Retrieval algorithms — vector, lexical, hybrid

What you'll take away from this chapter

Where vector search quietly fails

How keyword search works — BM25, briefly

Fusing two rankings — Reciprocal Rank Fusion

Hybrid retrieval in code

Four stacks, measured

When this fails

Practice — before you read the next chapter

Find your vector blind spots

Run the hybrid retriever

Break RRF on purpose

Takeaways

Discussion

Related Tutorials