If you remember one decision from this entire series, make it this one. The chunk is the unit of retrieval — the smallest piece your system can fetch and hand to the model. How you cut your documents into chunks decides, before any embedding model or vector database gets involved, what is even findable. Cut badly and the answer to a user's question gets split across two chunks so that neither one, alone, can answer it. No reranker saves you from that. No bigger model saves you from that. The information was destroyed at the knife.
This is also the chapter most tutorials wave at — "split into 1000 characters with 200 overlap" — and move on. That advice is not wrong so much as unexamined. This chapter examines it: what the strategies actually are, how they compare when measured, and how to choose for your documents rather than copying a number off a blog.
Every chunking decision is a single trade-off wearing different costumes. Small chunks are precise but starved of context: a two-sentence chunk embeds cleanly and matches a query sharply, but it may not contain enough surrounding information to actually answer anything. Large chunks are rich but blurry: a full-page chunk holds plenty of context, but its embedding is an average of many ideas, so it matches everything weakly and nothing strongly — and it burns tokens when you send it to the model.
There is no universal best chunk size, and anyone who gives you one without asking what your documents look like is guessing. Dense legal text wants different cuts than chatty support articles, which want different cuts than API reference docs. What's universal is the shape of the trade-off, and the obligation to measure where your documents sit on it.
Cut every N characters (or tokens), regardless of what the text is saying. Simple, fast, and brutal — it will happily slice through the middle of a sentence, a word, or a number. Fixed-size chunking is the strawman the other strategies improve on. Its one virtue is predictability; its vice is that it respects nothing about meaning.
Split on a priority list of separators — paragraphs first, then sentences, then words — only descending to a finer separator when a chunk is still too big. This keeps natural units intact whenever it can, falling back to brute force only when a single paragraph exceeds your size limit. It is the sensible default for most prose, and it's what you should build first.
Split where the meaning shifts. Walk through the text sentence by sentence, embed each one, and start a new chunk when consecutive sentences become dissimilar enough — a topic change. More expensive (you embed during chunking) and more sophisticated. It shines on documents that wander between topics without clean headings.
Use the document's own structure as the cut lines: Markdown headings, HTML sections, slide boundaries, code function definitions. This is why Chapter 02 insisted on preserving structure during prep — a heading is a chunk boundary the author drew for you. When your documents have good structure, this often beats everything else, because the author already grouped related ideas.
Here is a compact recursive splitter with overlap. It tries to break on the biggest natural separator that keeps chunks under the size limit, and carries a little text across each boundary so context isn't lost at the seams.
def recursive_split(text, chunk_size=800, overlap=120,
separators=("\n\n", "\n", ". ", " ")):
"""Split text, preferring the largest natural boundary that
keeps chunks under chunk_size. Overlap carries context across cuts."""
def split(text, seps):
if len(text) <= chunk_size or not seps:
return [text]
sep, *rest = seps
parts, chunks, buf = text.split(sep), [], ""
for part in parts:
candidate = (buf + sep + part) if buf else part
if len(candidate) <= chunk_size:
buf = candidate
else:
if buf:
chunks.append(buf)
# part itself may still be too big — recurse with finer sep
buf = part if len(part) <= chunk_size else ""
if len(part) > chunk_size:
chunks.extend(split(part, rest))
if buf:
chunks.append(buf)
return chunks
raw = split(text, list(separators))
# add overlap: prepend the tail of the previous chunk to each chunk
out = []
for i, c in enumerate(raw):
if i > 0 and overlap:
tail = raw[i - 1][-overlap:]
c = tail + " " + c
out.append(c.strip())
return out
doc = open("policy.txt").read()
chunks = recursive_split(doc, chunk_size=800, overlap=120)
print(f"{len(chunks)} chunks")
for i, c in enumerate(chunks[:2]):
print(f"\n[chunk {i}] {len(c)} chars")
print(c[:160], "...")
14 chunks
[chunk 0] 812 chars
Our refund policy allows returns within 30 days of purchase for a
full refund, provided the item is unused and in original packaging ...
[chunk 1] 798 chars
in original packaging. Exchanges are accepted within 60 days. To
start a return, visit your account page and select the order ...
Notice chunk 1 begins with "in original packaging" — the tail of chunk 0. That overlap is deliberate, and it's what the next section is about.
Overlap exists to protect against the answer landing exactly on a cut line. Without it, a sentence like "To qualify, the item must be unused" could end one chunk while "and returned within 30 days" begins the next — and a query about the time limit might retrieve only the second half, missing the qualifying condition. Overlap repeats a little text on both sides of every cut so that information straddling a boundary survives in at least one chunk intact.
How much? The common reflex is 10–20% of chunk size, and that's a reasonable starting point. But more overlap is not free: it inflates your index (you store the overlapping text twice), and it can cause near-duplicate chunks to crowd your retrieval results. The honest guidance is to start at 10–15% and only increase it if you observe answers being cut off at boundaries. Overlap is a patch for the bluntness of cutting, not a feature to maximise.
My take. Most teams over-overlap out of anxiety. If your chunks break on natural boundaries — paragraphs, sentences, headings — you need far less overlap than the "200 characters" advice suggests, because you're rarely cutting through the middle of an idea in the first place. Good boundaries reduce the need for overlap. Fix the boundaries before you crank the overlap.
Rather than assert a magic number, here is the shape of what you'll see when you actually measure retrieval quality against chunk size on a typical prose corpus. The numbers below are illustrative of the pattern, not a universal result — your corpus will shift the peak — but the curve's shape is remarkably consistent: quality rises, plateaus, then falls.
| Chunk size | Recall@5 | Answer quality | What's happening |
|---|---|---|---|
| ~200 chars | 0.61 | Low | Chunks too thin; context for the answer is split apart. |
| ~500 chars | 0.78 | Good | Approaching the useful middle. |
| ~800 chars | 0.83 | Best | The plateau — enough context, still sharp. |
| ~1500 chars | 0.80 | Good | Still fine; embeddings starting to blur. |
| ~3000 chars | 0.67 | Lower | One embedding averaging too many ideas; weak matches. |
The lesson is not "use 800 characters." It is "there is a plateau, it is broad, and you find it by measuring — but you can stop fiddling once you're on it." Teams waste days optimising chunk size from 800 to 820. The curve is flat there. Get onto the plateau and spend your remaining effort on retrieval and reranking, where the gains are larger. We build the eval that produces a table like this in Chapter 11.
There's a technique that sidesteps the small-versus-large tension instead of compromising on it. Embed small chunks for sharp matching, but when one is retrieved, hand the model the larger parent section it came from. You match precisely and answer richly. The small child is what the index searches; the big parent is what the model reads.
This requires storing the parent–child relationship as metadata (another reason Chapter 02 harped on metadata). It is one of the highest-return refinements in the whole field, and it costs little once your structure is preserved. Build naive first, but keep this in your pocket for when retrieval quality plateaus.
Take a document from your domain and split it with fixed-size, recursive, and (if it has headings) structural strategies. Read the resulting chunks. Which strategy produced chunks you could hand to a colleague and have them understand in isolation? That "understandable alone" test is the practical heart of good chunking.
Once you've read Chapter 11 and can measure recall, sweep chunk size across 300, 600, 900, 1500 on your own corpus and plot recall@5. Find your plateau and stop. Resist the urge to optimise within the flat region — that effort is better spent downstream.
Scan a sample of your chunks for words like "above," "below," "this," "the previous," "as mentioned." Each is a chunk leaning on context it may no longer have. Count them. A high count means your boundaries are cutting through connected ideas, and parent-child retrieval is probably worth building.
Next chapter: Embeddings — picking, evaluating, migrating. Your clean, well-cut chunks now need to become vectors. How do you choose an embedding model, prove it's good for your domain, and survive the day you have to switch to a better one?
Sign in to join the discussion and post comments.
Sign in