Code is text, so it's tempting to throw it at the Wave 1 pipeline unchanged. Don't. Chunking source code the way you chunk prose is one of the most destructive things you can do in RAG — it slices through the middle of functions, severs a function from its signature, splits a class from the method that gives it meaning, and strips away the imports that say what every symbol even refers to. The result retrieves beautifully and answers uselessly. Code has structure that prose doesn't, and retrieval over a codebase is its own discipline that respects that structure. This chapter is that discipline.
Recall from Chapter 03 that the chunk is the unit of retrieval and a chunk should be understandable on its own. A function is exactly such a unit — a self-contained piece of meaning a developer can read and reason about. Recursive character chunking is blind to that. Set it to 800 characters and it will cheerfully cut a 1,200-character function in half, producing one chunk with the function's opening and logic but no return, and another with the tail end and no signature. Neither half is runnable, neither half is understandable, and the embedding of each is a smear of half-an-idea. You've taken the one natural unit code offers and destroyed it on a byte count.
The fix is to parse the code into an Abstract Syntax Tree — the structured representation every compiler and IDE already builds — and chunk on its boundaries: one chunk per function, per method, per class. The AST knows exactly where a function starts and ends, so your chunks align with the units a developer actually thinks in. Most languages have a parser you can call; Python ships one in the standard library. Here is an AST chunker that splits a Python file into function- and class-level chunks, each carrying its full signature and the file's imports so it stands alone.
import ast
def chunk_python(source, filepath):
"""Split a Python file into one chunk per top-level function/class,
each prefixed with the file's imports so it's self-contained."""
tree = ast.parse(source)
lines = source.splitlines()
# collect import lines once — every chunk needs them for context
imports = [lines[n.lineno - 1] for n in ast.walk(tree)
if isinstance(n, (ast.Import, ast.ImportFrom))]
header = "\n".join(imports)
chunks = []
for node in tree.body: # top-level nodes only
if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef,
ast.ClassDef)):
start = node.lineno - 1
end = node.end_lineno # AST gives us exact bounds
body = "\n".join(lines[start:end])
# prepend imports as context; tag with metadata (Ch02 habit)
chunk_text = f"# from {filepath}\n{header}\n\n{body}"
chunks.append({
"text": chunk_text,
"symbol": node.name,
"kind": type(node).__name__,
"filepath": filepath,
"lines": (start + 1, end),
})
return chunks
src = open("payments.py").read()
for c in chunk_python(src, "payments.py"):
print(f"{c['kind']:12} {c['symbol']:20} lines {c['lines']}")
FunctionDef charge_card lines (8, 14)
FunctionDef refund lines (16, 23)
ClassDef PaymentGateway lines (25, 61)
Each chunk is now a whole function or class, prefixed with the imports that define its symbols, tagged with its name, kind, file, and line range. A query like "how do we charge a card" retrieves the entire charge_card function — signature, validation, and return — as one coherent unit, and the metadata lets the answer cite payments.py lines 8–14 exactly. This is the same metadata discipline from Chapter 02, paying off again.
Even a whole function is often not enough on its own. charge_card calls gateway.send() — to truly understand it, you may need to know what gateway is and what send does. Code is a graph of references: definitions, calls, imports, inheritance. The richest code-RAG systems capture some of that graph, so that retrieving a function can pull in the definitions of the key symbols it depends on. At minimum, prepend imports (done above). Going further, you resolve the most important called symbols and include their signatures. This is the bridge to the next chapter — when the relationships between chunks matter as much as the chunks themselves, you're edging toward graph RAG.
General-purpose embedding models from Chapter 04 understand code surprisingly well, because they were trained on plenty of it. Code-specific embedding models do better on code-to-code search and on matching natural-language questions to implementations, because they're tuned for it. The honest guidance is the same as always: run the fifty-question bake-off from Chapter 04, but with code questions and code chunks. If a general model already nails your retrieval, you've saved yourself a dependency. If your queries are "find the function that does X" and a code model measurably wins, switch. Measure; don't assume the specialised model is worth it.
| Decision | Prose-RAG default | Code-RAG choice |
|---|---|---|
| Chunking | Recursive on text separators | AST boundaries — function/class |
| Chunk context | Overlap | Imports + key symbol signatures |
| Embeddings | General text model | Test a code model; keep general if it wins |
| Retrieval | Hybrid (semantic + BM25) | Hybrid especially — exact symbol names matter |
| Metadata | Source, page | File, symbol name, line range, language |
Note the retrieval row: hybrid search from Chapter 06 is even more valuable for code than for prose, because so many code queries are exact-symbol lookups — "where is charge_card defined" — which is precisely the BM25 strength and the vector blind spot. Code is full of the rare exact tokens that make the keyword side earn its place.
My take. The single highest-leverage change for code RAG is AST chunking — it's the code equivalent of the structural chunking that won in Chapter 03, and the win is even larger because code's structure is unambiguous (a parser, not a heuristic, finds the boundaries). Get that right before reaching for code-specific embedding models or symbol graphs. A system with AST chunks and a general embedder beats one with character chunks and the fanciest code model, because no embedding recovers a function that was cut in half.
A real codebase is hundreds of thousands of interconnected symbols, and the answer to "how does authentication work here" is rarely one function — it's a flow across several files. Two implications. First, retrieval depth and reranking (Chapter 07) matter more, because you often need several related chunks, not one. Second, freshness is brutal: code changes constantly, and a stale index points developers at deleted or refactored functions. Re-index on commit, not on a schedule — an index that lags the codebase by a day is worse than no index, because it confidently returns code that no longer exists.
parse_config" is an exact-token query; vector search drifts to vaguely-similar functions. Use hybrid so the keyword side nails the exact name.Take a source file from your codebase and run the AST chunker (or your language's equivalent parser). Inspect the chunks — is each a whole, understandable function or class? Compare against what recursive character chunking at 800 characters would have produced on the same file. The difference is the chapter in one experiment.
Pick an exact function name in your codebase and query for it with pure vector search, then with hybrid. Watch the vector-only search return plausible-but-wrong functions while hybrid nails the exact definition. This shows why hybrid is non-negotiable for code.
Ask your code-RAG system a flow question — "how does a user request get authenticated?" — and see whether single-function retrieval can answer it. Where it falls short is precisely the gap graph RAG fills, which sets up the next chapter.
Next chapter: Graph RAG — when graphs beat vectors. Some questions are about relationships — "who reports to whom," "what depends on this," "how is A connected to C." Vector similarity can't answer those. We'll see when a knowledge graph beats a vector index, and the honest cost of building one.
Sign in to join the discussion and post comments.
Sign in