Your chunks are clean and well-cut. Now they have to become vectors, and the model that does that conversion — the embedding model — is the lens through which your entire system sees meaning. Pick a good lens and "cancel my subscription" finds the chunk about ending your plan even though they share no words. Pick a poor one and the same query drifts to whatever happens to use the word "cancel." The embedding model quietly sets the ceiling on retrieval quality, and most teams choose it the worst possible way: by copying whatever was at the top of a leaderboard the week they started.
This chapter is about choosing deliberately, proving the choice on your own data, and — the part nobody writes about — surviving the day you have to switch models without taking the system down. That last part will matter to you eventually, and you will be very glad to have read it before it does.
Embedding models vary along a handful of axes that matter far more than a single benchmark number. Knowing them turns "which model is best?" — an unanswerable question — into "which model fits my constraints?", which you can actually decide.
Here is the order I'd actually weigh these in for a typical project, and the reasoning behind it.
| Priority | Criterion | Why it ranks here |
|---|---|---|
| 1 | Domain & language fit | A model that understands your domain's words beats a higher-ranked general model every time. This is the axis that moves retrieval quality most. |
| 2 | Hosting & privacy | If your data can't leave your network, half the options vanish regardless of score. Decide this early; it eliminates whole branches. |
| 3 | Cost at your volume | Embedding fifty million chunks, then re-embedding on every change, adds up. API per-token pricing vs the fixed cost of self-hosting flips depending on scale. |
| 4 | Dimension | Affects storage and search speed. Higher isn't better past a point; it's just more expensive. Pick the smallest that holds your quality. |
| 5 | Leaderboard score | A useful tiebreaker and a sanity check — not a decision. Benchmarks measure performance on their data, not yours. |
My take. The MTEB leaderboard is genuinely useful for one thing: ruling out models that are simply weak. A model in the bottom third is probably not worth your time. But among the top, say, fifteen, the leaderboard ranking tells you almost nothing about how a model will do on your contracts, your tickets, your code. The rank differences up there are smaller than the noise introduced by your own domain. Use the leaderboard to make a shortlist, then ignore it and measure.
You do not need a research-grade benchmark to choose a model. You need fifty real questions from your domain, each paired with the chunk that actually answers it, and a loop that measures how often each candidate model retrieves that chunk. Here is the whole thing.
# pip install sentence-transformers numpy
import numpy as np
from sentence_transformers import SentenceTransformer
# Your domain eval set: (question, id of the chunk that answers it)
# Fifty of these, hand-labelled, is enough to choose a model.
eval_set = [
("how do I cancel my subscription", "chunk_042"),
("what is the refund window", "chunk_017"),
("can I change my plan mid-cycle", "chunk_088"),
# ... 47 more
]
# Your chunks: {id: text}
chunks = {"chunk_042": "To end your subscription, open Account ...",
"chunk_017": "Refunds are available within 30 days ...",
"chunk_088": "You may upgrade or downgrade at any time ..."}
# ... the rest
def recall_at_k(model_name, k=5):
"""What fraction of questions retrieve their answer chunk in the top k?"""
model = SentenceTransformer(model_name)
ids = list(chunks.keys())
# embed every chunk once, then normalise so dot product == cosine
chunk_vecs = model.encode([chunks[i] for i in ids],
normalize_embeddings=True)
hits = 0
for question, answer_id in eval_set:
q = model.encode(question, normalize_embeddings=True)
sims = chunk_vecs @ q # cosine similarity to every chunk
top = [ids[i] for i in np.argsort(-sims)[:k]]
if answer_id in top:
hits += 1
return hits / len(eval_set)
for name in ["all-MiniLM-L6-v2", # tiny, 384-dim, fast
"BAAI/bge-base-en-v1.5", # mid, 768-dim
"BAAI/bge-large-en-v1.5"]: # large, 1024-dim
print(f"{name:32s} recall@5 = {recall_at_k(name):.2f}")
all-MiniLM-L6-v2 recall@5 = 0.74
BAAI/bge-base-en-v1.5 recall@5 = 0.86
BAAI/bge-large-en-v1.5 recall@5 = 0.88
Read that result the way an engineer should, not the way a leaderboard wants you to. The large model wins — but by two points over the base model, while being roughly twice the size and slower to search. The base model is almost certainly the right production choice: nearly the same quality at half the storage and latency. The tiny model trails by twelve points, which on your domain is the difference between "usually finds it" and "often doesn't." That decision — base, not large, not tiny — came from your fifty questions, not from anyone's benchmark. This is the single most valuable afternoon you can spend in a RAG project.
Here is the scenario nobody prepares you for. Your system has been live for six months on one embedding model. A clearly better model comes out. You want to switch. But your entire index — fifty million vectors — was produced by the old model, and vectors from two different models are not comparable. A query embedded with the new model cannot be meaningfully compared against chunks embedded with the old one; the geometry is different. You cannot mix them. You must re-embed everything, and you must do it without taking the system down or serving wrong results during the switch.
The playbook in words:
The whole reason this matters: a team that re-embeds in place discovers, halfway through, that the system is serving a mix of old and new vectors and every result is subtly wrong, with no way back. The build-beside pattern makes the migration boring, which is exactly what you want a migration to be. Capture model name and version in your chunk metadata (per Chapter 02) so you always know which vectors came from which model.
A full embedding stores each dimension as a 32-bit floating point number. Quantization stores them in less — 8 bits, or even 1 bit per dimension (binary quantization) — shrinking your index dramatically. A 1024-dimension float vector is 4 KB; the same vector quantized to 8 bits is 1 KB; binary is 128 bytes. Across fifty million vectors that is the difference between an index that fits in memory and one that doesn't.
The trade-off is some accuracy loss, but it is usually smaller than people fear — often a point or two of recall for an 8-bit version, recoverable by re-ranking the top results with full-precision vectors. The honest guidance: at small scale, don't bother — full precision is simpler and the storage is trivial. At large scale, quantization is one of the highest-leverage cost savings available, and the accuracy cost is modest if you re-rank. We see the search-speed side of this in Chapter 05.
This is the most valuable artifact in your whole project, and you can start it today. Collect fifty real questions from your domain — from support logs, from colleagues, from your own head — and for each, find the chunk that actually answers it. This set will choose your embedding model, tune your chunking, and prove your improvements for the life of the system. Build it once; use it forever.
Take the code above and run three candidate models against your eval set. Look at the recall numbers and the model sizes together. Which model gives you the most quality per unit of cost? Notice whether the leaderboard ranking predicted your result — it often won't, and that surprise is the lesson.
Before you ever need it, write down the four migration steps for your specific stack: where the second index lives, how you shadow-test, what the cutover flag is, how long you keep the old index. A migration runbook written calmly today is worth ten times one written in a panic when a better model ships.
Next chapter: The vector index — databases and the geometry. Your vectors exist; now they need a home that can find nearest neighbours among millions in milliseconds. We'll tour the vector databases honestly and look at the geometry — HNSW and IVF — beneath them, without hand-waving.
Sign in to join the discussion and post comments.
Sign in