On this tutorial

RAG: A Field Manual for Building LLM Systems That Use Your Data

Foundations

Data

Retrieval

Generation & Evaluation

Specialized verticals

Production

Closing

Foundations — the RAG vocabulary

Every craft has a private language, and you cannot move quickly in it until the words stop slowing you down. RAG's vocabulary is small — maybe fifteen terms carry ninety percent of the conversation — but the terms are slippery. People say "embedding" when they mean "vector," "index" when they mean "database," and "relevance" when they mean three different measurable things. This chapter pins each term to one clear meaning and shows how they connect, so that every chapter after this one reads at full speed.

Read it once now, end to end, even if some terms are familiar. Then keep it open as a reference for the rest of the series. The goal is not to memorise definitions; it is to make the words invisible, so you can think about the ideas behind them.

What you'll take away from this chapter

The fifteen terms that recur through every later chapter, grouped into four families that mirror the pipeline
The difference between an embedding, a vector, and an index — three words people use interchangeably and shouldn't
What cosine similarity actually measures, explained as geometry you can picture
The three levels of RAG — naive, advanced, modular — and which one you should build first
The handful of metric terms (recall, precision, faithfulness) that let you say "it got better" and mean it

The four families, mapped to the pipeline

The cleanest way to hold the vocabulary is to notice that the terms cluster around the four stages of work you already saw on the series overview: preparing data, retrieving from it, generating an answer, and measuring quality. Learn the words in those four groups and they stop being a list and start being a map.

Blue families are the machinery (data and retrieval); green families are the outcome (generation and quality). Almost every sentence in this series uses a word from at least two of these boxes.

Family 1 — Data

Document

A single source item before any processing — one PDF, one web page, one support article, one wiki entry. Documents are what you have; they are rarely what you search. Almost always you break them into smaller pieces first.

Chunk

A piece of a document, sized to be retrieved and read on its own. A chunk might be a paragraph, a section, or a few sentences. Chunking is the act of splitting documents into chunks, and it is the single most consequential decision in the whole pipeline — important enough that it gets its own chapter. Hold this thought: the chunk, not the document, is the unit of retrieval.

Embedding

The process — and the result — of converting a piece of text into a list of numbers that captures its meaning. "Embedding the chunk" is the act; "the chunk's embedding" is the resulting list of numbers. A model called an embedding model does this conversion. Text that means similar things produces number-lists that are close together; text that means different things produces number-lists that are far apart. That closeness is the entire mechanism behind semantic search.

Vector

The list of numbers itself, considered as a point in space. "Embedding" and "vector" are often used interchangeably, and in casual conversation that's fine, but the precise distinction is worth keeping: embedding is the meaning-preserving representation; vector is the mathematical object — an ordered list of numbers — that holds it. Every embedding is stored as a vector; not every vector is an embedding.

Dimension

How many numbers are in the vector. A 1,024-dimension embedding is a list of 1,024 numbers. More dimensions can capture more nuance but cost more to store and compare. Common sizes in 2026 run from 384 to 3,072. You will choose a dimension when you choose an embedding model in Chapter 04.

Family 2 — Retrieval

Index

A data structure built over all your vectors so that, given a new vector, you can find its nearest neighbours quickly. Without an index you would compare the query against every stored vector one by one — fine for a thousand chunks, hopeless for fifty million. The index is the difference between a search that takes a millisecond and one that takes a minute. Note the precise usage: the index is the structure; the vector database is the system that stores and serves it. People blur the two; Chapter 05 keeps them apart.

Retriever

The component that takes a query and returns the most relevant chunks. In the simplest system, the retriever embeds the query, asks the index for nearest neighbours, and returns them. Retrievers can be far more elaborate — combining several search methods, rewriting the query first — but the job is always the same: question in, relevant chunks out.

Similarity

A number that says how close two vectors are. Higher similarity means more alike in meaning. The dominant measure is cosine similarity, which we picture in a moment. When a retriever ranks chunks, it ranks them by similarity to the query.

Top-k

How many chunks the retriever returns. "Top-k = 5" means "give me the five most similar chunks." Choosing k is a real trade-off: too few and you may miss the answer; too many and you drown the model in noise and pay for tokens you didn't need. We tune k repeatedly through the retrieval chapters.

Reranker

A second-stage component that takes the chunks the retriever found and re-orders them more carefully. The retriever is fast and approximate; the reranker is slower and precise. Run the retriever to get the top 50 candidates cheaply, then the reranker to pick the best 5 accurately. It is a powerful pattern with its own chapter.

Family 3 — Generation

Context

The retrieved chunks, assembled and placed into the model's prompt as evidence. "The context" in RAG specifically means the fetched material the model is meant to answer from — distinct from the broader "context window," which is the model's total token capacity. When someone says "put it in the context," they mean: include this chunk as evidence.

Grounding

The degree to which an answer is supported by the provided context. A well-grounded answer says only what the evidence supports. An ungrounded answer wanders off into the model's general knowledge — which may be outdated or simply wrong for your domain. Maximising grounding is most of the work in Chapter 09.

Citation

A pointer in the answer back to the specific chunk or document the claim came from. Citations are what make a RAG answer trustworthy and checkable — the user can click through and verify. A system that answers without citations is asking to be believed on faith, which in most serious settings is unacceptable.

Hallucination

When the model states something not supported by the evidence — and often not true at all — with full confidence. RAG reduces hallucination by giving the model real evidence to lean on, but it does not eliminate it; a model can still ignore the context or over-extend beyond it. Anyone who tells you RAG "solves hallucination" has not measured carefully.

Family 4 — Quality

Recall

Of all the chunks that should have been retrieved for a query, what fraction did the retriever actually find? Recall measures whether the right evidence made it into the context at all. If recall is low, nothing downstream can save you — the model cannot answer from evidence it never received.

Precision

Of all the chunks the retriever returned, what fraction were actually relevant? Precision measures noise. Low precision means you are padding the context with junk, which costs tokens and can distract the model. Recall and precision pull against each other — retrieve more and recall rises while precision falls — and balancing them is a recurring theme.

Faithfulness

Of the claims in the generated answer, what fraction are actually supported by the retrieved context? Faithfulness is grounding, made measurable. It is the single most important generation metric, because an unfaithful answer is worse than no answer — it is a confident lie with citations stapled on. We make this measurable in Chapter 11.

Eval and golden set

An eval is a repeatable test of your system's quality on a fixed set of questions with known-good answers. The golden set is that fixed set — questions paired with the answers (or the chunks) a careful human certified as correct. Without an eval and a golden set, "we made it better" is a feeling, not a fact. Building them is unglamorous and is the difference between engineering and guessing.

What cosine similarity actually measures

This is the one piece of geometry worth genuinely understanding, because it is the heartbeat of semantic search. Forget the formula for a second and picture two arrows starting from the same origin point. Each arrow is a vector — one for the query, one for a chunk. Cosine similarity measures the angle between them, not their length.

Left: two phrases that mean nearly the same thing point in nearly the same direction — small angle, high similarity. Right: unrelated phrases point in very different directions — large angle, low similarity. Real embeddings live in hundreds of dimensions, but the intuition is exactly this.

Two arrows pointing in almost the same direction have a small angle between them, and the cosine of a small angle is close to 1 — high similarity. Two arrows at right angles have a cosine of 0 — unrelated. Arrows pointing opposite ways have a cosine of −1 — opposite meaning. That is the whole trick. The embedding model's job is to place text so that meaning becomes direction; cosine similarity then reads the meaning back out as an angle.

Why angle and not distance? Because length, in embedding space, often encodes things you don't care about for relevance — like how long or emphatic the text is. By measuring angle alone, cosine similarity asks "do these point the same way?" rather than "are these the same size?" That is usually the right question for "do these mean the same thing?" The geometry returns in earnest in Chapter 05.

The three levels of RAG

People talk about "naive," "advanced," and "modular" RAG as if they were rival camps. They are better understood as three points on a single road of increasing sophistication. Knowing where a system sits tells you what it can and can't do.

Level	Shape	What it adds	Build this when
Naive	embed → retrieve top-k → generate	Nothing extra. The straight line.	Always, first. It is your baseline and your honest reference point.
Advanced	+ query rewriting, hybrid search, reranking	Better retrieval before and after the index.	The naive baseline's failures are clearly about retrieval quality.
Modular	+ routing, multiple retrievers, agentic loops	Different strategies for different query types.	You have measured, distinct query classes that need distinct handling.

My take. Build naive first, always, and measure it before you reach for anything fancier. The most common waste in this field is teams building "modular" systems for problems a naive pipeline would have solved — and never knowing it, because they never built the baseline to compare against. Sophistication you can't justify with a measurement is just cost.

When this fails

The failure mode for a vocabulary chapter is using the words loosely, and loose words cause real engineering mistakes. Three that recur:

Confusing recall with precision. A team reports "90% accuracy" without saying which. If recall is 90% but precision is 30%, the context is mostly noise and the model is struggling — a very different problem from the reverse. Always name which number you mean.
Treating "the index" and "the database" as one thing. When performance degrades, this confusion sends people tuning the wrong layer. The index is an algorithm; the database is the service around it. They fail differently and you fix them differently.
Calling everything an "embedding." When a team says "the embedding is slow," do they mean the embedding model (generating vectors), the index (searching them), or the database (serving them)? Three different bottlenecks, three different fixes. Precise words point at the real problem.

Practice — before you read the next chapter

Translate a system you've read about

Find any RAG product or blog post. Re-describe what it does using only this chapter's vocabulary: what are its chunks, what does its retriever do, where does reranking happen if at all, which level (naive, advanced, modular) is it? Notice which details the writeup leaves vague — vagueness usually hides either a weakness or a thing the author didn't understand either.

Reason about angle

Without any code, predict the relative cosine similarity of three pairs: ("cancel my subscription", "how do I unsubscribe"), ("cancel my subscription", "upgrade my plan"), ("cancel my subscription", "the weather in Mumbai"). Rank them from most to least similar and write one sentence on why. This builds the intuition that makes retrieval results legible later.

Draft your metric sentence

Write the single sentence you would use to claim your future system "works." Force yourself to use recall, precision, or faithfulness with a number in it. If you can't write the sentence, you have found the eval you need to build — which is exactly the point of Chapter 11.

Takeaways

The vocabulary clusters into four families that mirror the pipeline: data, retrieval, generation, quality. Learn it as a map, not a list.
The chunk — not the document — is the unit of retrieval. Embedding is the act and the meaning; vector is the mathematical object; index is the structure for searching vectors; the database is the service around it. Keep these distinct.
Cosine similarity measures the angle between two vectors. Same direction means same meaning. That single idea powers semantic search.
Three levels of RAG — naive, advanced, modular — are points on one road. Build naive first and measure it before adding sophistication.
Recall, precision, and faithfulness are the words that turn "it got better" from a feeling into a fact. Say which one you mean, with a number.

Next chapter: Data prep — parsing the messy real world. Before you can chunk or embed anything, you have to get clean text out of PDFs, HTML, tables, and scans. It is the unglamorous 40% of real RAG work, and the chapter almost every other tutorial skips.

Discussion

Why RAG, and why it didn't die with long context Data prep — parsing the messy real world