On this tutorial

RAG: A Field Manual for Building LLM Systems That Use Your Data

Foundations

Data

Retrieval

Generation & Evaluation

Specialized verticals

Production

Closing

RAG for code — AST-aware, symbol-aware, repo-scale

Code is text, so it's tempting to throw it at the Wave 1 pipeline unchanged. Don't. Chunking source code the way you chunk prose is one of the most destructive things you can do in RAG — it slices through the middle of functions, severs a function from its signature, splits a class from the method that gives it meaning, and strips away the imports that say what every symbol even refers to. The result retrieves beautifully and answers uselessly. Code has structure that prose doesn't, and retrieval over a codebase is its own discipline that respects that structure. This chapter is that discipline.

What you'll take away from this chapter

Why character-based chunking is uniquely catastrophic for code
AST-aware chunking — splitting on function and class boundaries instead of byte counts
Why a code chunk needs its context — imports, signatures, the symbols it references
Whether you need a code-specific embedding model, and when a general one is fine
The repo-scale problem: a function alone often can't answer; you need its neighbours

Why prose chunking destroys code

Recall from Chapter 03 that the chunk is the unit of retrieval and a chunk should be understandable on its own. A function is exactly such a unit — a self-contained piece of meaning a developer can read and reason about. Recursive character chunking is blind to that. Set it to 800 characters and it will cheerfully cut a 1,200-character function in half, producing one chunk with the function's opening and logic but no return, and another with the tail end and no signature. Neither half is runnable, neither half is understandable, and the embedding of each is a smear of half-an-idea. You've taken the one natural unit code offers and destroyed it on a byte count.

Character chunking severs the function at an arbitrary byte; the second chunk is an orphan with no signature. AST chunking cuts at the function boundary the language itself defines, keeping each unit whole and meaningful.

AST-aware chunking

The fix is to parse the code into an Abstract Syntax Tree — the structured representation every compiler and IDE already builds — and chunk on its boundaries: one chunk per function, per method, per class. The AST knows exactly where a function starts and ends, so your chunks align with the units a developer actually thinks in. Most languages have a parser you can call; Python ships one in the standard library. Here is an AST chunker that splits a Python file into function- and class-level chunks, each carrying its full signature and the file's imports so it stands alone.

import ast

def chunk_python(source, filepath):
    """Split a Python file into one chunk per top-level function/class,
    each prefixed with the file's imports so it's self-contained."""
    tree = ast.parse(source)
    lines = source.splitlines()

    # collect import lines once — every chunk needs them for context
    imports = [lines[n.lineno - 1] for n in ast.walk(tree)
               if isinstance(n, (ast.Import, ast.ImportFrom))]
    header = "\n".join(imports)

    chunks = []
    for node in tree.body:                      # top-level nodes only
        if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef,
                             ast.ClassDef)):
            start = node.lineno - 1
            end = node.end_lineno               # AST gives us exact bounds
            body = "\n".join(lines[start:end])
            # prepend imports as context; tag with metadata (Ch02 habit)
            chunk_text = f"# from {filepath}\n{header}\n\n{body}"
            chunks.append({
                "text": chunk_text,
                "symbol": node.name,
                "kind": type(node).__name__,
                "filepath": filepath,
                "lines": (start + 1, end),
            })
    return chunks

src = open("payments.py").read()
for c in chunk_python(src, "payments.py"):
    print(f"{c['kind']:12} {c['symbol']:20} lines {c['lines']}")

FunctionDef  charge_card          lines (8, 14)
FunctionDef  refund               lines (16, 23)
ClassDef     PaymentGateway       lines (25, 61)

Each chunk is now a whole function or class, prefixed with the imports that define its symbols, tagged with its name, kind, file, and line range. A query like "how do we charge a card" retrieves the entire charge_card function — signature, validation, and return — as one coherent unit, and the metadata lets the answer cite payments.py lines 8–14 exactly. This is the same metadata discipline from Chapter 02, paying off again.

A chunk needs its context

Even a whole function is often not enough on its own. charge_card calls gateway.send() — to truly understand it, you may need to know what gateway is and what send does. Code is a graph of references: definitions, calls, imports, inheritance. The richest code-RAG systems capture some of that graph, so that retrieving a function can pull in the definitions of the key symbols it depends on. At minimum, prepend imports (done above). Going further, you resolve the most important called symbols and include their signatures. This is the bridge to the next chapter — when the relationships between chunks matter as much as the chunks themselves, you're edging toward graph RAG.

Do you need a code embedding model?

General-purpose embedding models from Chapter 04 understand code surprisingly well, because they were trained on plenty of it. Code-specific embedding models do better on code-to-code search and on matching natural-language questions to implementations, because they're tuned for it. The honest guidance is the same as always: run the fifty-question bake-off from Chapter 04, but with code questions and code chunks. If a general model already nails your retrieval, you've saved yourself a dependency. If your queries are "find the function that does X" and a code model measurably wins, switch. Measure; don't assume the specialised model is worth it.

Decision	Prose-RAG default	Code-RAG choice
Chunking	Recursive on text separators	AST boundaries — function/class
Chunk context	Overlap	Imports + key symbol signatures
Embeddings	General text model	Test a code model; keep general if it wins
Retrieval	Hybrid (semantic + BM25)	Hybrid especially — exact symbol names matter
Metadata	Source, page	File, symbol name, line range, language

Note the retrieval row: hybrid search from Chapter 06 is even more valuable for code than for prose, because so many code queries are exact-symbol lookups — "where is charge_card defined" — which is precisely the BM25 strength and the vector blind spot. Code is full of the rare exact tokens that make the keyword side earn its place.

My take. The single highest-leverage change for code RAG is AST chunking — it's the code equivalent of the structural chunking that won in Chapter 03, and the win is even larger because code's structure is unambiguous (a parser, not a heuristic, finds the boundaries). Get that right before reaching for code-specific embedding models or symbol graphs. A system with AST chunks and a general embedder beats one with character chunks and the fanciest code model, because no embedding recovers a function that was cut in half.

The repo-scale problem

A real codebase is hundreds of thousands of interconnected symbols, and the answer to "how does authentication work here" is rarely one function — it's a flow across several files. Two implications. First, retrieval depth and reranking (Chapter 07) matter more, because you often need several related chunks, not one. Second, freshness is brutal: code changes constantly, and a stale index points developers at deleted or refactored functions. Re-index on commit, not on a schedule — an index that lags the codebase by a day is worse than no index, because it confidently returns code that no longer exists.

When this fails

Character chunking code. The cardinal sin, worth repeating. It severs functions and produces unrunnable, meaningless half-chunks. Parse to an AST and chunk on boundaries.
Dropping imports and signatures. A function chunk without its imports leaves every symbol it uses undefined and unsearchable. Prepend the context that makes the chunk self-contained.
Pure semantic search for symbol lookups. "Where is parse_config" is an exact-token query; vector search drifts to vaguely-similar functions. Use hybrid so the keyword side nails the exact name.
A stale index. Code churns fast. An index that isn't refreshed on commit returns functions that were renamed or deleted, sending developers on wild-goose chases. Re-index on change.
Giant files as one chunk. The opposite error: a 2,000-line file chunked whole is an unsearchable blob averaging dozens of functions. AST chunking fixes this too — split into its members.
Ignoring the call graph for flow questions. "How does a request travel through the system" can't be answered by any single function. For flow questions, you need the relationships — which is exactly what the next chapter is about.

Practice — before you read the next chapter

AST-chunk a real file

Take a source file from your codebase and run the AST chunker (or your language's equivalent parser). Inspect the chunks — is each a whole, understandable function or class? Compare against what recursive character chunking at 800 characters would have produced on the same file. The difference is the chapter in one experiment.

Test a symbol lookup

Pick an exact function name in your codebase and query for it with pure vector search, then with hybrid. Watch the vector-only search return plausible-but-wrong functions while hybrid nails the exact definition. This shows why hybrid is non-negotiable for code.

Trace a flow question

Ask your code-RAG system a flow question — "how does a user request get authenticated?" — and see whether single-function retrieval can answer it. Where it falls short is precisely the gap graph RAG fills, which sets up the next chapter.

Takeaways

Character chunking is uniquely destructive for code — it severs functions into meaningless, unrunnable halves. Parse to an AST and chunk on function and class boundaries.
Make each code chunk self-contained: prepend imports and, ideally, the signatures of key symbols it references. Tag it with file, symbol name, and line range.
General embedding models handle code well; test a code-specific model with a code bake-off and switch only if it measurably wins.
Hybrid search is especially vital for code, because so many queries are exact-symbol lookups — a BM25 strength and a vector blind spot.
At repo scale, flow questions span many functions (lean on depth and reranking) and freshness is critical (re-index on commit, not on a schedule).

Next chapter: Graph RAG — when graphs beat vectors. Some questions are about relationships — "who reports to whom," "what depends on this," "how is A connected to C." Vector similarity can't answer those. We'll see when a knowledge graph beats a vector index, and the honest cost of building one.

Discussion

Multi-modal RAG — images, video, audio Graph RAG — when graphs beat vectors

RAG for code — AST-aware, symbol-aware, repo-scale

What you'll take away from this chapter

Why character-based chunking is uniquely catastrophic for code
AST-aware chunking — splitting on function and class boundaries instead of byte counts
Why a code chunk needs its context — imports, signatures, the symbols it references
Whether you need a code-specific embedding model, and when a general one is fine
The repo-scale problem: a function alone often can't answer; you need its neighbours

Why prose chunking destroys code

AST-aware chunking

import ast

def chunk_python(source, filepath):
    """Split a Python file into one chunk per top-level function/class,
    each prefixed with the file's imports so it's self-contained."""
    tree = ast.parse(source)
    lines = source.splitlines()

    # collect import lines once — every chunk needs them for context
    imports = [lines[n.lineno - 1] for n in ast.walk(tree)
               if isinstance(n, (ast.Import, ast.ImportFrom))]
    header = "\n".join(imports)

    chunks = []
    for node in tree.body:                      # top-level nodes only
        if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef,
                             ast.ClassDef)):
            start = node.lineno - 1
            end = node.end_lineno               # AST gives us exact bounds
            body = "\n".join(lines[start:end])
            # prepend imports as context; tag with metadata (Ch02 habit)
            chunk_text = f"# from {filepath}\n{header}\n\n{body}"
            chunks.append({
                "text": chunk_text,
                "symbol": node.name,
                "kind": type(node).__name__,
                "filepath": filepath,
                "lines": (start + 1, end),
            })
    return chunks

src = open("payments.py").read()
for c in chunk_python(src, "payments.py"):
    print(f"{c['kind']:12} {c['symbol']:20} lines {c['lines']}")

FunctionDef  charge_card          lines (8, 14)
FunctionDef  refund               lines (16, 23)
ClassDef     PaymentGateway       lines (25, 61)

A chunk needs its context

Do you need a code embedding model?

Decision	Prose-RAG default	Code-RAG choice
Chunking	Recursive on text separators	AST boundaries — function/class
Chunk context	Overlap	Imports + key symbol signatures
Embeddings	General text model	Test a code model; keep general if it wins
Retrieval	Hybrid (semantic + BM25)	Hybrid especially — exact symbol names matter
Metadata	Source, page	File, symbol name, line range, language

My take. The single highest-leverage change for code RAG is AST chunking — it's the code equivalent of the structural chunking that won in Chapter 03, and the win is even larger because code's structure is unambiguous (a parser, not a heuristic, finds the boundaries). Get that right before reaching for code-specific embedding models or symbol graphs. A system with AST chunks and a general embedder beats one with character chunks and the fanciest code model, because no embedding recovers a function that was cut in half.

The repo-scale problem

When this fails

Character chunking code. The cardinal sin, worth repeating. It severs functions and produces unrunnable, meaningless half-chunks. Parse to an AST and chunk on boundaries.
Dropping imports and signatures. A function chunk without its imports leaves every symbol it uses undefined and unsearchable. Prepend the context that makes the chunk self-contained.
Pure semantic search for symbol lookups. "Where is parse_config" is an exact-token query; vector search drifts to vaguely-similar functions. Use hybrid so the keyword side nails the exact name.
A stale index. Code churns fast. An index that isn't refreshed on commit returns functions that were renamed or deleted, sending developers on wild-goose chases. Re-index on change.
Giant files as one chunk. The opposite error: a 2,000-line file chunked whole is an unsearchable blob averaging dozens of functions. AST chunking fixes this too — split into its members.
Ignoring the call graph for flow questions. "How does a request travel through the system" can't be answered by any single function. For flow questions, you need the relationships — which is exactly what the next chapter is about.

Practice — before you read the next chapter

AST-chunk a real file

Test a symbol lookup

Trace a flow question

Takeaways

Character chunking is uniquely destructive for code — it severs functions into meaningless, unrunnable halves. Parse to an AST and chunk on function and class boundaries.
Make each code chunk self-contained: prepend imports and, ideally, the signatures of key symbols it references. Tag it with file, symbol name, and line range.
General embedding models handle code well; test a code-specific model with a code bake-off and switch only if it measurably wins.
Hybrid search is especially vital for code, because so many queries are exact-symbol lookups — a BM25 strength and a vector blind spot.
At repo scale, flow questions span many functions (lean on depth and reranking) and freshness is critical (re-index on commit, not on a schedule).

Discussion

Multi-modal RAG — images, video, audio Graph RAG — when graphs beat vectors

RAG for code — AST-aware, symbol-aware, repo-scale

What you'll take away from this chapter

Why prose chunking destroys code

AST-aware chunking

A chunk needs its context

Do you need a code embedding model?

The repo-scale problem

When this fails

Practice — before you read the next chapter

AST-chunk a real file

Test a symbol lookup

Trace a flow question

Takeaways

Discussion

Related Tutorials

RAG for code — AST-aware, symbol-aware, repo-scale

What you'll take away from this chapter

Why prose chunking destroys code

AST-aware chunking

A chunk needs its context

Do you need a code embedding model?

The repo-scale problem

When this fails

Practice — before you read the next chapter

AST-chunk a real file

Test a symbol lookup

Trace a flow question

Takeaways

Discussion

Related Tutorials