On this tutorial

RAG: A Field Manual for Building LLM Systems That Use Your Data

Foundations

Data

Retrieval

Generation & Evaluation

Specialized verticals

Production

Closing

Chunking — the hardest problem

If you remember one decision from this entire series, make it this one. The chunk is the unit of retrieval — the smallest piece your system can fetch and hand to the model. How you cut your documents into chunks decides, before any embedding model or vector database gets involved, what is even findable. Cut badly and the answer to a user's question gets split across two chunks so that neither one, alone, can answer it. No reranker saves you from that. No bigger model saves you from that. The information was destroyed at the knife.

This is also the chapter most tutorials wave at — "split into 1000 characters with 200 overlap" — and move on. That advice is not wrong so much as unexamined. This chapter examines it: what the strategies actually are, how they compare when measured, and how to choose for your documents rather than copying a number off a blog.

What you'll take away from this chapter

Why chunking is the highest-leverage decision in the pipeline, and the core tension it must balance
The four chunking strategies — fixed, recursive, semantic, structural — and what each is good and bad at
What overlap is really for, and how much you actually need
Runnable code for a recursive splitter you can adapt today
A measured comparison of chunk sizes, and a taxonomy of the ways chunking quietly fails

The tension at the heart of chunking

Every chunking decision is a single trade-off wearing different costumes. Small chunks are precise but starved of context: a two-sentence chunk embeds cleanly and matches a query sharply, but it may not contain enough surrounding information to actually answer anything. Large chunks are rich but blurry: a full-page chunk holds plenty of context, but its embedding is an average of many ideas, so it matches everything weakly and nothing strongly — and it burns tokens when you send it to the model.

Neither end is right. Too small and the chunk matches sharply but can't answer; too large and it answers but matches weakly. The craft of chunking is finding the middle for your documents — and the middle differs by document type.

There is no universal best chunk size, and anyone who gives you one without asking what your documents look like is guessing. Dense legal text wants different cuts than chatty support articles, which want different cuts than API reference docs. What's universal is the shape of the trade-off, and the obligation to measure where your documents sit on it.

The four strategies

1 · Fixed-size

Cut every N characters (or tokens), regardless of what the text is saying. Simple, fast, and brutal — it will happily slice through the middle of a sentence, a word, or a number. Fixed-size chunking is the strawman the other strategies improve on. Its one virtue is predictability; its vice is that it respects nothing about meaning.

2 · Recursive

Split on a priority list of separators — paragraphs first, then sentences, then words — only descending to a finer separator when a chunk is still too big. This keeps natural units intact whenever it can, falling back to brute force only when a single paragraph exceeds your size limit. It is the sensible default for most prose, and it's what you should build first.

3 · Semantic

Split where the meaning shifts. Walk through the text sentence by sentence, embed each one, and start a new chunk when consecutive sentences become dissimilar enough — a topic change. More expensive (you embed during chunking) and more sophisticated. It shines on documents that wander between topics without clean headings.

4 · Structural / document-aware

Use the document's own structure as the cut lines: Markdown headings, HTML sections, slide boundaries, code function definitions. This is why Chapter 02 insisted on preserving structure during prep — a heading is a chunk boundary the author drew for you. When your documents have good structure, this often beats everything else, because the author already grouped related ideas.

The same passage, four ways. Notice fixed-size is the only one that produces a chunk no human would ever write — "ys of purch" is not a unit of meaning, and it will embed as nonsense.

A recursive splitter you can use

Here is a compact recursive splitter with overlap. It tries to break on the biggest natural separator that keeps chunks under the size limit, and carries a little text across each boundary so context isn't lost at the seams.

def recursive_split(text, chunk_size=800, overlap=120,
                    separators=("\n\n", "\n", ". ", " ")):
    """Split text, preferring the largest natural boundary that
    keeps chunks under chunk_size. Overlap carries context across cuts."""
    def split(text, seps):
        if len(text) <= chunk_size or not seps:
            return [text]
        sep, *rest = seps
        parts, chunks, buf = text.split(sep), [], ""
        for part in parts:
            candidate = (buf + sep + part) if buf else part
            if len(candidate) <= chunk_size:
                buf = candidate
            else:
                if buf:
                    chunks.append(buf)
                # part itself may still be too big — recurse with finer sep
                buf = part if len(part) <= chunk_size else ""
                if len(part) > chunk_size:
                    chunks.extend(split(part, rest))
        if buf:
            chunks.append(buf)
        return chunks

    raw = split(text, list(separators))
    # add overlap: prepend the tail of the previous chunk to each chunk
    out = []
    for i, c in enumerate(raw):
        if i > 0 and overlap:
            tail = raw[i - 1][-overlap:]
            c = tail + " " + c
        out.append(c.strip())
    return out

doc = open("policy.txt").read()
chunks = recursive_split(doc, chunk_size=800, overlap=120)
print(f"{len(chunks)} chunks")
for i, c in enumerate(chunks[:2]):
    print(f"\n[chunk {i}] {len(c)} chars")
    print(c[:160], "...")

14 chunks

[chunk 0] 812 chars
Our refund policy allows returns within 30 days of purchase for a
full refund, provided the item is unused and in original packaging ...

[chunk 1] 798 chars
in original packaging. Exchanges are accepted within 60 days. To
start a return, visit your account page and select the order ...

Notice chunk 1 begins with "in original packaging" — the tail of chunk 0. That overlap is deliberate, and it's what the next section is about.

What overlap is really for

Overlap exists to protect against the answer landing exactly on a cut line. Without it, a sentence like "To qualify, the item must be unused" could end one chunk while "and returned within 30 days" begins the next — and a query about the time limit might retrieve only the second half, missing the qualifying condition. Overlap repeats a little text on both sides of every cut so that information straddling a boundary survives in at least one chunk intact.

How much? The common reflex is 10–20% of chunk size, and that's a reasonable starting point. But more overlap is not free: it inflates your index (you store the overlapping text twice), and it can cause near-duplicate chunks to crowd your retrieval results. The honest guidance is to start at 10–15% and only increase it if you observe answers being cut off at boundaries. Overlap is a patch for the bluntness of cutting, not a feature to maximise.

My take. Most teams over-overlap out of anxiety. If your chunks break on natural boundaries — paragraphs, sentences, headings — you need far less overlap than the "200 characters" advice suggests, because you're rarely cutting through the middle of an idea in the first place. Good boundaries reduce the need for overlap. Fix the boundaries before you crank the overlap.

Chunk size, measured

Rather than assert a magic number, here is the shape of what you'll see when you actually measure retrieval quality against chunk size on a typical prose corpus. The numbers below are illustrative of the pattern, not a universal result — your corpus will shift the peak — but the curve's shape is remarkably consistent: quality rises, plateaus, then falls.

Chunk size	Recall@5	Answer quality	What's happening
~200 chars	0.61	Low	Chunks too thin; context for the answer is split apart.
~500 chars	0.78	Good	Approaching the useful middle.
~800 chars	0.83	Best	The plateau — enough context, still sharp.
~1500 chars	0.80	Good	Still fine; embeddings starting to blur.
~3000 chars	0.67	Lower	One embedding averaging too many ideas; weak matches.

The lesson is not "use 800 characters." It is "there is a plateau, it is broad, and you find it by measuring — but you can stop fiddling once you're on it." Teams waste days optimising chunk size from 800 to 820. The curve is flat there. Get onto the plateau and spend your remaining effort on retrieval and reranking, where the gains are larger. We build the eval that produces a table like this in Chapter 11.

Parent-child chunking: the best of both ends

There's a technique that sidesteps the small-versus-large tension instead of compromising on it. Embed small chunks for sharp matching, but when one is retrieved, hand the model the larger parent section it came from. You match precisely and answer richly. The small child is what the index searches; the big parent is what the model reads.

This requires storing the parent–child relationship as metadata (another reason Chapter 02 harped on metadata). It is one of the highest-return refinements in the whole field, and it costs little once your structure is preserved. Build naive first, but keep this in your pocket for when retrieval quality plateaus.

When this fails

Cutting through tables and code. A table or a function split across two chunks is ruined — half a table is misleading, half a function won't run. Detect these regions during prep and keep them whole, even if they exceed your size limit. Structure beats size for these.
Orphaned references. A chunk that says "as described above, this voids the warranty" is useless when "above" is in a different chunk. Watch for chunks that lean on context outside themselves; parent-child or larger boundaries fix this.
The lonely heading. Structural chunking can produce a chunk that is just a heading with no body, or a body with its heading severed. Always attach the heading to the section it titles — the heading is often the most retrievable part.
Uniform size for non-uniform documents. Forcing 800-character chunks onto a corpus that mixes one-line FAQ answers with twenty-page contracts serves neither. Route by document type; let short documents be one chunk.
Over-overlapping into duplicates. Too much overlap fills your top-k with near-identical chunks, so the model sees the same paragraph five times and the actually-relevant second chunk gets pushed out. More overlap is not more safety past a point.

Practice — before you read the next chapter

Chunk one real document four ways

Take a document from your domain and split it with fixed-size, recursive, and (if it has headings) structural strategies. Read the resulting chunks. Which strategy produced chunks you could hand to a colleague and have them understand in isolation? That "understandable alone" test is the practical heart of good chunking.

Find your plateau

Once you've read Chapter 11 and can measure recall, sweep chunk size across 300, 600, 900, 1500 on your own corpus and plot recall@5. Find your plateau and stop. Resist the urge to optimise within the flat region — that effort is better spent downstream.

Spot the orphans

Scan a sample of your chunks for words like "above," "below," "this," "the previous," "as mentioned." Each is a chunk leaning on context it may no longer have. Count them. A high count means your boundaries are cutting through connected ideas, and parent-child retrieval is probably worth building.

Takeaways

The chunk is the unit of retrieval. How you cut decides what is findable — before any model or database is involved.
The core tension: small chunks match sharply but lack context; large chunks hold context but match weakly. Aim for the broad middle.
Four strategies: fixed (avoid), recursive (sensible default), semantic (for wandering text), structural (best when documents have good structure).
Overlap protects against answers landing on a cut line. Start at 10–15%; good boundaries reduce how much you need. Don't over-overlap into duplicates.
There's a broad plateau in chunk size — find it by measuring, then stop. Parent-child chunking (embed small, return large) sidesteps the tension entirely.
Keep tables and code whole; attach headings to their sections; route by document type. Uniform chunking on non-uniform documents serves no one.

Next chapter: Embeddings — picking, evaluating, migrating. Your clean, well-cut chunks now need to become vectors. How do you choose an embedding model, prove it's good for your domain, and survive the day you have to switch to a better one?

Discussion

Data prep — parsing the messy real world Embeddings — picking, evaluating, migrating