Every craft has a private language, and you cannot move quickly in it until the words stop slowing you down. RAG's vocabulary is small — maybe fifteen terms carry ninety percent of the conversation — but the terms are slippery. People say "embedding" when they mean "vector," "index" when they mean "database," and "relevance" when they mean three different measurable things. This chapter pins each term to one clear meaning and shows how they connect, so that every chapter after this one reads at full speed.
Read it once now, end to end, even if some terms are familiar. Then keep it open as a reference for the rest of the series. The goal is not to memorise definitions; it is to make the words invisible, so you can think about the ideas behind them.
The cleanest way to hold the vocabulary is to notice that the terms cluster around the four stages of work you already saw on the series overview: preparing data, retrieving from it, generating an answer, and measuring quality. Learn the words in those four groups and they stop being a list and start being a map.
A single source item before any processing — one PDF, one web page, one support article, one wiki entry. Documents are what you have; they are rarely what you search. Almost always you break them into smaller pieces first.
A piece of a document, sized to be retrieved and read on its own. A chunk might be a paragraph, a section, or a few sentences. Chunking is the act of splitting documents into chunks, and it is the single most consequential decision in the whole pipeline — important enough that it gets its own chapter. Hold this thought: the chunk, not the document, is the unit of retrieval.
The process — and the result — of converting a piece of text into a list of numbers that captures its meaning. "Embedding the chunk" is the act; "the chunk's embedding" is the resulting list of numbers. A model called an embedding model does this conversion. Text that means similar things produces number-lists that are close together; text that means different things produces number-lists that are far apart. That closeness is the entire mechanism behind semantic search.
The list of numbers itself, considered as a point in space. "Embedding" and "vector" are often used interchangeably, and in casual conversation that's fine, but the precise distinction is worth keeping: embedding is the meaning-preserving representation; vector is the mathematical object — an ordered list of numbers — that holds it. Every embedding is stored as a vector; not every vector is an embedding.
How many numbers are in the vector. A 1,024-dimension embedding is a list of 1,024 numbers. More dimensions can capture more nuance but cost more to store and compare. Common sizes in 2026 run from 384 to 3,072. You will choose a dimension when you choose an embedding model in Chapter 04.
A data structure built over all your vectors so that, given a new vector, you can find its nearest neighbours quickly. Without an index you would compare the query against every stored vector one by one — fine for a thousand chunks, hopeless for fifty million. The index is the difference between a search that takes a millisecond and one that takes a minute. Note the precise usage: the index is the structure; the vector database is the system that stores and serves it. People blur the two; Chapter 05 keeps them apart.
The component that takes a query and returns the most relevant chunks. In the simplest system, the retriever embeds the query, asks the index for nearest neighbours, and returns them. Retrievers can be far more elaborate — combining several search methods, rewriting the query first — but the job is always the same: question in, relevant chunks out.
A number that says how close two vectors are. Higher similarity means more alike in meaning. The dominant measure is cosine similarity, which we picture in a moment. When a retriever ranks chunks, it ranks them by similarity to the query.
How many chunks the retriever returns. "Top-k = 5" means "give me the five most similar chunks." Choosing k is a real trade-off: too few and you may miss the answer; too many and you drown the model in noise and pay for tokens you didn't need. We tune k repeatedly through the retrieval chapters.
A second-stage component that takes the chunks the retriever found and re-orders them more carefully. The retriever is fast and approximate; the reranker is slower and precise. Run the retriever to get the top 50 candidates cheaply, then the reranker to pick the best 5 accurately. It is a powerful pattern with its own chapter.
The retrieved chunks, assembled and placed into the model's prompt as evidence. "The context" in RAG specifically means the fetched material the model is meant to answer from — distinct from the broader "context window," which is the model's total token capacity. When someone says "put it in the context," they mean: include this chunk as evidence.
The degree to which an answer is supported by the provided context. A well-grounded answer says only what the evidence supports. An ungrounded answer wanders off into the model's general knowledge — which may be outdated or simply wrong for your domain. Maximising grounding is most of the work in Chapter 09.
A pointer in the answer back to the specific chunk or document the claim came from. Citations are what make a RAG answer trustworthy and checkable — the user can click through and verify. A system that answers without citations is asking to be believed on faith, which in most serious settings is unacceptable.
When the model states something not supported by the evidence — and often not true at all — with full confidence. RAG reduces hallucination by giving the model real evidence to lean on, but it does not eliminate it; a model can still ignore the context or over-extend beyond it. Anyone who tells you RAG "solves hallucination" has not measured carefully.
Of all the chunks that should have been retrieved for a query, what fraction did the retriever actually find? Recall measures whether the right evidence made it into the context at all. If recall is low, nothing downstream can save you — the model cannot answer from evidence it never received.
Of all the chunks the retriever returned, what fraction were actually relevant? Precision measures noise. Low precision means you are padding the context with junk, which costs tokens and can distract the model. Recall and precision pull against each other — retrieve more and recall rises while precision falls — and balancing them is a recurring theme.
Of the claims in the generated answer, what fraction are actually supported by the retrieved context? Faithfulness is grounding, made measurable. It is the single most important generation metric, because an unfaithful answer is worse than no answer — it is a confident lie with citations stapled on. We make this measurable in Chapter 11.
An eval is a repeatable test of your system's quality on a fixed set of questions with known-good answers. The golden set is that fixed set — questions paired with the answers (or the chunks) a careful human certified as correct. Without an eval and a golden set, "we made it better" is a feeling, not a fact. Building them is unglamorous and is the difference between engineering and guessing.
This is the one piece of geometry worth genuinely understanding, because it is the heartbeat of semantic search. Forget the formula for a second and picture two arrows starting from the same origin point. Each arrow is a vector — one for the query, one for a chunk. Cosine similarity measures the angle between them, not their length.
Two arrows pointing in almost the same direction have a small angle between them, and the cosine of a small angle is close to 1 — high similarity. Two arrows at right angles have a cosine of 0 — unrelated. Arrows pointing opposite ways have a cosine of −1 — opposite meaning. That is the whole trick. The embedding model's job is to place text so that meaning becomes direction; cosine similarity then reads the meaning back out as an angle.
Why angle and not distance? Because length, in embedding space, often encodes things you don't care about for relevance — like how long or emphatic the text is. By measuring angle alone, cosine similarity asks "do these point the same way?" rather than "are these the same size?" That is usually the right question for "do these mean the same thing?" The geometry returns in earnest in Chapter 05.
People talk about "naive," "advanced," and "modular" RAG as if they were rival camps. They are better understood as three points on a single road of increasing sophistication. Knowing where a system sits tells you what it can and can't do.
| Level | Shape | What it adds | Build this when |
|---|---|---|---|
| Naive | embed → retrieve top-k → generate | Nothing extra. The straight line. | Always, first. It is your baseline and your honest reference point. |
| Advanced | + query rewriting, hybrid search, reranking | Better retrieval before and after the index. | The naive baseline's failures are clearly about retrieval quality. |
| Modular | + routing, multiple retrievers, agentic loops | Different strategies for different query types. | You have measured, distinct query classes that need distinct handling. |
My take. Build naive first, always, and measure it before you reach for anything fancier. The most common waste in this field is teams building "modular" systems for problems a naive pipeline would have solved — and never knowing it, because they never built the baseline to compare against. Sophistication you can't justify with a measurement is just cost.
The failure mode for a vocabulary chapter is using the words loosely, and loose words cause real engineering mistakes. Three that recur:
Find any RAG product or blog post. Re-describe what it does using only this chapter's vocabulary: what are its chunks, what does its retriever do, where does reranking happen if at all, which level (naive, advanced, modular) is it? Notice which details the writeup leaves vague — vagueness usually hides either a weakness or a thing the author didn't understand either.
Without any code, predict the relative cosine similarity of three pairs: ("cancel my subscription", "how do I unsubscribe"), ("cancel my subscription", "upgrade my plan"), ("cancel my subscription", "the weather in Mumbai"). Rank them from most to least similar and write one sentence on why. This builds the intuition that makes retrieval results legible later.
Write the single sentence you would use to claim your future system "works." Force yourself to use recall, precision, or faithfulness with a number in it. If you can't write the sentence, you have found the eval you need to build — which is exactly the point of Chapter 11.
Next chapter: Data prep — parsing the messy real world. Before you can chunk or embed anything, you have to get clean text out of PDFs, HTML, tables, and scans. It is the unglamorous 40% of real RAG work, and the chapter almost every other tutorial skips.
Sign in to join the discussion and post comments.
Sign in