On this tutorial

RAG: A Field Manual for Building LLM Systems That Use Your Data

Foundations

Data

Retrieval

Generation & Evaluation

Specialized verticals

Production

Closing

Embeddings — picking, evaluating, migrating

Your chunks are clean and well-cut. Now they have to become vectors, and the model that does that conversion — the embedding model — is the lens through which your entire system sees meaning. Pick a good lens and "cancel my subscription" finds the chunk about ending your plan even though they share no words. Pick a poor one and the same query drifts to whatever happens to use the word "cancel." The embedding model quietly sets the ceiling on retrieval quality, and most teams choose it the worst possible way: by copying whatever was at the top of a leaderboard the week they started.

This chapter is about choosing deliberately, proving the choice on your own data, and — the part nobody writes about — surviving the day you have to switch models without taking the system down. That last part will matter to you eventually, and you will be very glad to have read it before it does.

What you'll take away from this chapter

What actually differs between embedding models, beyond the leaderboard rank
The criteria that should drive your choice — and why "highest MTEB score" isn't one of them
How to evaluate a model on your domain in an afternoon, instead of trusting a benchmark built on someone else's data
The migration playbook: changing embedding models in a live system without downtime or silent corruption
When quantization saves you real money, and what it costs in accuracy

What actually differs between models

Embedding models vary along a handful of axes that matter far more than a single benchmark number. Knowing them turns "which model is best?" — an unanswerable question — into "which model fits my constraints?", which you can actually decide.

The blue axes are engineering constraints; the green axes decide quality. Most teams obsess over dimension (blue) and ignore domain fit (green) — which is backwards. Domain fit is usually the axis that moves your numbers.

The criteria that should drive the choice

Here is the order I'd actually weigh these in for a typical project, and the reasoning behind it.

Priority	Criterion	Why it ranks here
1	Domain & language fit	A model that understands your domain's words beats a higher-ranked general model every time. This is the axis that moves retrieval quality most.
2	Hosting & privacy	If your data can't leave your network, half the options vanish regardless of score. Decide this early; it eliminates whole branches.
3	Cost at your volume	Embedding fifty million chunks, then re-embedding on every change, adds up. API per-token pricing vs the fixed cost of self-hosting flips depending on scale.
4	Dimension	Affects storage and search speed. Higher isn't better past a point; it's just more expensive. Pick the smallest that holds your quality.
5	Leaderboard score	A useful tiebreaker and a sanity check — not a decision. Benchmarks measure performance on their data, not yours.

My take. The MTEB leaderboard is genuinely useful for one thing: ruling out models that are simply weak. A model in the bottom third is probably not worth your time. But among the top, say, fifteen, the leaderboard ranking tells you almost nothing about how a model will do on your contracts, your tickets, your code. The rank differences up there are smaller than the noise introduced by your own domain. Use the leaderboard to make a shortlist, then ignore it and measure.

Evaluate on your domain in an afternoon

You do not need a research-grade benchmark to choose a model. You need fifty real questions from your domain, each paired with the chunk that actually answers it, and a loop that measures how often each candidate model retrieves that chunk. Here is the whole thing.

# pip install sentence-transformers numpy
import numpy as np
from sentence_transformers import SentenceTransformer

# Your domain eval set: (question, id of the chunk that answers it)
# Fifty of these, hand-labelled, is enough to choose a model.
eval_set = [
    ("how do I cancel my subscription", "chunk_042"),
    ("what is the refund window", "chunk_017"),
    ("can I change my plan mid-cycle", "chunk_088"),
    # ... 47 more
]

# Your chunks: {id: text}
chunks = {"chunk_042": "To end your subscription, open Account ...",
          "chunk_017": "Refunds are available within 30 days ...",
          "chunk_088": "You may upgrade or downgrade at any time ..."}
          # ... the rest

def recall_at_k(model_name, k=5):
    """What fraction of questions retrieve their answer chunk in the top k?"""
    model = SentenceTransformer(model_name)
    ids = list(chunks.keys())
    # embed every chunk once, then normalise so dot product == cosine
    chunk_vecs = model.encode([chunks[i] for i in ids],
                              normalize_embeddings=True)
    hits = 0
    for question, answer_id in eval_set:
        q = model.encode(question, normalize_embeddings=True)
        sims = chunk_vecs @ q              # cosine similarity to every chunk
        top = [ids[i] for i in np.argsort(-sims)[:k]]
        if answer_id in top:
            hits += 1
    return hits / len(eval_set)

for name in ["all-MiniLM-L6-v2",        # tiny, 384-dim, fast
             "BAAI/bge-base-en-v1.5",    # mid, 768-dim
             "BAAI/bge-large-en-v1.5"]:  # large, 1024-dim
    print(f"{name:32s} recall@5 = {recall_at_k(name):.2f}")

all-MiniLM-L6-v2                 recall@5 = 0.74
BAAI/bge-base-en-v1.5            recall@5 = 0.86
BAAI/bge-large-en-v1.5           recall@5 = 0.88

Read that result the way an engineer should, not the way a leaderboard wants you to. The large model wins — but by two points over the base model, while being roughly twice the size and slower to search. The base model is almost certainly the right production choice: nearly the same quality at half the storage and latency. The tiny model trails by twelve points, which on your domain is the difference between "usually finds it" and "often doesn't." That decision — base, not large, not tiny — came from your fifty questions, not from anyone's benchmark. This is the single most valuable afternoon you can spend in a RAG project.

The migration playbook

Here is the scenario nobody prepares you for. Your system has been live for six months on one embedding model. A clearly better model comes out. You want to switch. But your entire index — fifty million vectors — was produced by the old model, and vectors from two different models are not comparable. A query embedded with the new model cannot be meaningfully compared against chunks embedded with the old one; the geometry is different. You cannot mix them. You must re-embed everything, and you must do it without taking the system down or serving wrong results during the switch.

Never re-embed in place. Build the new index alongside the old one, prove it's better on your golden set while users still hit the old one, then cut over — and keep the old index until you're sure, so rollback is one switch away.

The playbook in words:

Build beside, never in place. Stand up a second index with the new model. Re-embed your whole corpus into it as a background job. The old index keeps serving the entire time, so users notice nothing.
Shadow-test on your golden set. Before any user touches the new index, run the same evaluation questions through both. Confirm the new model actually wins on your data — sometimes the shiny new model is worse for your domain, and you find that out here, safely.
Cut over behind a flag. Flip queries to the new index with a config switch, ideally to a small percentage of traffic first. Watch your metrics.
Keep the old index for rollback. Don't delete it the moment you switch. Keep it for a week or two so that "undo" is a single flag flip, not a multi-day re-embedding job under pressure.

The whole reason this matters: a team that re-embeds in place discovers, halfway through, that the system is serving a mix of old and new vectors and every result is subtly wrong, with no way back. The build-beside pattern makes the migration boring, which is exactly what you want a migration to be. Capture model name and version in your chunk metadata (per Chapter 02) so you always know which vectors came from which model.

Quantization — cheaper vectors, mostly free

A full embedding stores each dimension as a 32-bit floating point number. Quantization stores them in less — 8 bits, or even 1 bit per dimension (binary quantization) — shrinking your index dramatically. A 1024-dimension float vector is 4 KB; the same vector quantized to 8 bits is 1 KB; binary is 128 bytes. Across fifty million vectors that is the difference between an index that fits in memory and one that doesn't.

The trade-off is some accuracy loss, but it is usually smaller than people fear — often a point or two of recall for an 8-bit version, recoverable by re-ranking the top results with full-precision vectors. The honest guidance: at small scale, don't bother — full precision is simpler and the storage is trivial. At large scale, quantization is one of the highest-leverage cost savings available, and the accuracy cost is modest if you re-rank. We see the search-speed side of this in Chapter 05.

When this fails

Choosing by leaderboard alone. The top model on a public benchmark can underperform a lower-ranked one on your specific domain. Always run your own fifty-question eval before committing.
Mismatched query and document embedding. Some models expect an instruction prefix on queries (like "query: ...") but not on documents. Get this wrong and every similarity score is quietly degraded. Read the model card; embed queries and documents exactly as the model expects.
Re-embedding in place. The migration killer. Mixing old and new vectors in one index produces silently wrong results with no rollback. Always build beside.
Forgetting to re-embed after a chunking change. If you change your chunking strategy, every vector is now stale — they describe the old chunks. A chunking change is an embedding migration. Treat it as one.
Over-quantizing at small scale. Binary quantization on a 10,000-chunk index saves a few megabytes and costs you accuracy for no real benefit. Match the optimisation to the scale.

Practice — before you read the next chapter

Build your fifty-question eval set

This is the most valuable artifact in your whole project, and you can start it today. Collect fifty real questions from your domain — from support logs, from colleagues, from your own head — and for each, find the chunk that actually answers it. This set will choose your embedding model, tune your chunking, and prove your improvements for the life of the system. Build it once; use it forever.

Run the bake-off

Take the code above and run three candidate models against your eval set. Look at the recall numbers and the model sizes together. Which model gives you the most quality per unit of cost? Notice whether the leaderboard ranking predicted your result — it often won't, and that surprise is the lesson.

Write your migration runbook

Before you ever need it, write down the four migration steps for your specific stack: where the second index lives, how you shadow-test, what the cutover flag is, how long you keep the old index. A migration runbook written calmly today is worth ten times one written in a panic when a better model ships.

Takeaways

The embedding model sets the ceiling on retrieval quality. Choose it deliberately, not by leaderboard reflex.
Weigh domain and language fit first, then hosting and privacy, then cost, then dimension. Leaderboard score is a tiebreaker, not a decision.
Evaluate candidates on fifty real questions from your own domain. It takes an afternoon and beats any public benchmark for your decision.
Migrate by building a new index beside the old one, shadow-testing on your golden set, cutting over behind a flag, and keeping the old index for rollback. Never re-embed in place.
Quantization is a large-scale cost saver with modest accuracy cost; skip it at small scale.
A chunking change is an embedding migration — re-embed when you re-chunk.

Next chapter: The vector index — databases and the geometry. Your vectors exist; now they need a home that can find nearest neighbours among millions in milliseconds. We'll tour the vector databases honestly and look at the geometry — HNSW and IVF — beneath them, without hand-waving.

Discussion

Chunking — the hardest problem The vector index — databases and the geometry

Embeddings — picking, evaluating, migrating

What you'll take away from this chapter

What actually differs between embedding models, beyond the leaderboard rank
The criteria that should drive your choice — and why "highest MTEB score" isn't one of them
How to evaluate a model on your domain in an afternoon, instead of trusting a benchmark built on someone else's data
The migration playbook: changing embedding models in a live system without downtime or silent corruption
When quantization saves you real money, and what it costs in accuracy

What actually differs between models

The criteria that should drive the choice

Here is the order I'd actually weigh these in for a typical project, and the reasoning behind it.

Priority	Criterion	Why it ranks here
1	Domain & language fit	A model that understands your domain's words beats a higher-ranked general model every time. This is the axis that moves retrieval quality most.
2	Hosting & privacy	If your data can't leave your network, half the options vanish regardless of score. Decide this early; it eliminates whole branches.
3	Cost at your volume	Embedding fifty million chunks, then re-embedding on every change, adds up. API per-token pricing vs the fixed cost of self-hosting flips depending on scale.
4	Dimension	Affects storage and search speed. Higher isn't better past a point; it's just more expensive. Pick the smallest that holds your quality.
5	Leaderboard score	A useful tiebreaker and a sanity check — not a decision. Benchmarks measure performance on their data, not yours.

My take. The MTEB leaderboard is genuinely useful for one thing: ruling out models that are simply weak. A model in the bottom third is probably not worth your time. But among the top, say, fifteen, the leaderboard ranking tells you almost nothing about how a model will do on your contracts, your tickets, your code. The rank differences up there are smaller than the noise introduced by your own domain. Use the leaderboard to make a shortlist, then ignore it and measure.

Evaluate on your domain in an afternoon

# pip install sentence-transformers numpy
import numpy as np
from sentence_transformers import SentenceTransformer

# Your domain eval set: (question, id of the chunk that answers it)
# Fifty of these, hand-labelled, is enough to choose a model.
eval_set = [
    ("how do I cancel my subscription", "chunk_042"),
    ("what is the refund window", "chunk_017"),
    ("can I change my plan mid-cycle", "chunk_088"),
    # ... 47 more
]

# Your chunks: {id: text}
chunks = {"chunk_042": "To end your subscription, open Account ...",
          "chunk_017": "Refunds are available within 30 days ...",
          "chunk_088": "You may upgrade or downgrade at any time ..."}
          # ... the rest

def recall_at_k(model_name, k=5):
    """What fraction of questions retrieve their answer chunk in the top k?"""
    model = SentenceTransformer(model_name)
    ids = list(chunks.keys())
    # embed every chunk once, then normalise so dot product == cosine
    chunk_vecs = model.encode([chunks[i] for i in ids],
                              normalize_embeddings=True)
    hits = 0
    for question, answer_id in eval_set:
        q = model.encode(question, normalize_embeddings=True)
        sims = chunk_vecs @ q              # cosine similarity to every chunk
        top = [ids[i] for i in np.argsort(-sims)[:k]]
        if answer_id in top:
            hits += 1
    return hits / len(eval_set)

for name in ["all-MiniLM-L6-v2",        # tiny, 384-dim, fast
             "BAAI/bge-base-en-v1.5",    # mid, 768-dim
             "BAAI/bge-large-en-v1.5"]:  # large, 1024-dim
    print(f"{name:32s} recall@5 = {recall_at_k(name):.2f}")

all-MiniLM-L6-v2                 recall@5 = 0.74
BAAI/bge-base-en-v1.5            recall@5 = 0.86
BAAI/bge-large-en-v1.5           recall@5 = 0.88

The migration playbook

The playbook in words:

Build beside, never in place. Stand up a second index with the new model. Re-embed your whole corpus into it as a background job. The old index keeps serving the entire time, so users notice nothing.
Shadow-test on your golden set. Before any user touches the new index, run the same evaluation questions through both. Confirm the new model actually wins on your data — sometimes the shiny new model is worse for your domain, and you find that out here, safely.
Cut over behind a flag. Flip queries to the new index with a config switch, ideally to a small percentage of traffic first. Watch your metrics.
Keep the old index for rollback. Don't delete it the moment you switch. Keep it for a week or two so that "undo" is a single flag flip, not a multi-day re-embedding job under pressure.

Quantization — cheaper vectors, mostly free

When this fails

Choosing by leaderboard alone. The top model on a public benchmark can underperform a lower-ranked one on your specific domain. Always run your own fifty-question eval before committing.
Mismatched query and document embedding. Some models expect an instruction prefix on queries (like "query: ...") but not on documents. Get this wrong and every similarity score is quietly degraded. Read the model card; embed queries and documents exactly as the model expects.
Re-embedding in place. The migration killer. Mixing old and new vectors in one index produces silently wrong results with no rollback. Always build beside.
Forgetting to re-embed after a chunking change. If you change your chunking strategy, every vector is now stale — they describe the old chunks. A chunking change is an embedding migration. Treat it as one.
Over-quantizing at small scale. Binary quantization on a 10,000-chunk index saves a few megabytes and costs you accuracy for no real benefit. Match the optimisation to the scale.

Practice — before you read the next chapter

Build your fifty-question eval set

Run the bake-off

Write your migration runbook

Takeaways

The embedding model sets the ceiling on retrieval quality. Choose it deliberately, not by leaderboard reflex.
Weigh domain and language fit first, then hosting and privacy, then cost, then dimension. Leaderboard score is a tiebreaker, not a decision.
Evaluate candidates on fifty real questions from your own domain. It takes an afternoon and beats any public benchmark for your decision.
Migrate by building a new index beside the old one, shadow-testing on your golden set, cutting over behind a flag, and keeping the old index for rollback. Never re-embed in place.
Quantization is a large-scale cost saver with modest accuracy cost; skip it at small scale.
A chunking change is an embedding migration — re-embed when you re-chunk.

Discussion

Chunking — the hardest problem The vector index — databases and the geometry

Embeddings — picking, evaluating, migrating

What you'll take away from this chapter

What actually differs between models

The criteria that should drive the choice

Evaluate on your domain in an afternoon

The migration playbook

Quantization — cheaper vectors, mostly free

When this fails

Practice — before you read the next chapter

Build your fifty-question eval set

Run the bake-off

Write your migration runbook

Takeaways

Discussion

Related Tutorials

Embeddings — picking, evaluating, migrating

What you'll take away from this chapter

What actually differs between models

The criteria that should drive the choice

Evaluate on your domain in an afternoon

The migration playbook

Quantization — cheaper vectors, mostly free

When this fails

Practice — before you read the next chapter

Build your fifty-question eval set

Run the bake-off

Write your migration runbook

Takeaways

Discussion

Related Tutorials