On this tutorial

RAG: A Field Manual for Building LLM Systems That Use Your Data

Foundations

Data

Retrieval

Generation & Evaluation

Specialized verticals

Production

Closing

Data prep — parsing the messy real world

Here is the part nobody puts in the demo. Before you can chunk a document, embed it, or retrieve from it, you have to turn it into clean text — and your data does not arrive as clean text. It arrives as PDFs with two columns and a footer on every page, as HTML wrapped in three layers of navigation menus, as scanned contracts that are really just photographs of words, as spreadsheets where the meaning lives in the layout. Teams that skip this step build a beautiful retrieval pipeline on top of garbage and then wonder why the answers are wrong. The retrieval was perfect. The input was broken.

This chapter is about the unglamorous forty percent of real RAG work: getting trustworthy text out of untrustworthy formats. It is the chapter most tutorials skip, which is exactly why most tutorials produce demos that fall over on real documents.

What you'll take away from this chapter

Why data prep, not the model or the database, is the usual ceiling on RAG quality
The five document types you'll actually meet, and the specific way each one breaks
A practical extraction-and-cleaning pipeline, with runnable Python you can adapt
How to handle the three hardest cases — multi-column PDFs, tables, and scanned pages
The metadata you should capture during prep, because you cannot recover it later

Garbage in, confident garbage out

In Chapter 00 we noted that retrieval faithfully fetches whatever is in your corpus, in both directions. Data prep is where that corpus is made. If your extraction merges a table's columns into a wordsoup, the chunk that contains the answer is now unreadable, and no embedding model on earth will rank it correctly. If your PDF parser reads a two-column page straight across, every sentence is spliced to the sentence beside it, and the meaning is gone before you ever embed a thing.

The uncomfortable truth: improving your parser usually buys more accuracy than improving your embedding model or your reranker. It is the least glamorous lever and frequently the largest one. Spend your first week of any RAG project here, looking at the actual extracted text, before you touch anything downstream.

The single most useful habit in this whole series. After extraction, print the text and read it with your own eyes — a random sample of twenty chunks. Not the document; the extracted text. You will find merged columns, lost tables, page numbers glued into sentences, and headers repeating every 500 words. Every one of those is a retrieval bug you can fix now or debug painfully later.

The data-prep funnel

Prep is a sequence of narrowing steps, each one turning something messy into something slightly more usable. Picture it as a funnel: raw bytes go in the top, retrieval-ready text comes out the bottom.

Each step is narrower and more opinionated than the last. The funnel only widens at one point — chunking, which comes next chapter — when one document becomes many chunks. Everything here happens before that split.

The five document types, and how each one breaks

Type	How it breaks	The fix in one line
Born-digital PDF	Multi-column read straight across; headers/footers in every page's text; ligatures mangled.	Use a layout-aware parser, not a naive text dump.
Scanned PDF / image	It's a picture; there is no text to extract at all.	OCR first, then treat as text — and expect errors.
HTML	Navigation, ads, cookie banners, and footers outweigh the article.	Extract the main content node; discard the chrome.
Tables (in any format)	Linearised into a wordsoup; row/column relationships lost.	Detect tables; serialise each row as a self-describing sentence.
Office docs (DOCX, PPTX)	Text is recoverable but structure (headings, slides) is easily lost.	Use a structure-preserving extractor; keep heading hierarchy.

Extracting clean text — a working example

Here is a realistic extraction step for born-digital PDFs that respects layout and strips the repeating header/footer noise. It uses pdfplumber, which exposes the position of every word so you can reason about columns rather than reading blindly across the page.

# pip install pdfplumber
import pdfplumber
from collections import Counter

def extract_pdf(path):
    """Extract text from a born-digital PDF, column-aware,
    with repeating headers/footers removed."""
    pages_text = []
    line_counter = Counter()  # to detect repeating header/footer lines

    with pdfplumber.open(path) as pdf:
        raw_pages = []
        for page in pdf.pages:
            # extract_text with layout=True keeps columns from
            # being spliced together left-to-right
            text = page.extract_text(layout=True) or ""
            lines = [ln.strip() for ln in text.split("\n") if ln.strip()]
            raw_pages.append(lines)
            # count every line across the whole document
            for ln in lines:
                line_counter[ln] += 1

    # a line that appears on most pages is almost certainly
    # a running header or footer, not real content
    n_pages = len(raw_pages)
    boilerplate = {ln for ln, c in line_counter.items()
                   if c > n_pages * 0.5 and n_pages > 2}

    for lines in raw_pages:
        kept = [ln for ln in lines if ln not in boilerplate]
        pages_text.append("\n".join(kept))

    return pages_text, boilerplate

pages, removed = extract_pdf("annual-report.pdf")
print(f"Extracted {len(pages)} pages")
print(f"Removed {len(removed)} boilerplate lines:")
for ln in list(removed)[:3]:
    print(f"  · {ln!r}")
print("\n--- page 4 preview ---")
print(pages[3][:300])

Running this on a typical 60-page report prints something like:

Extracted 60 pages
Removed 3 boilerplate lines:
  · 'ACME Corporation — Annual Report 2025'
  · 'Confidential — Do Not Distribute'
  · 'Page %d of 60'

--- page 4 preview ---
Revenue grew 18% year over year, driven primarily by the
expansion of the enterprise segment in the EMEA region.
Operating margin improved to 23%, reflecting disciplined
cost management across the business...

Two things earned their keep here. layout=True kept the two-column financial section from being read across into nonsense. And the frequency-based boilerplate detector removed the header and footer that would otherwise appear in every single chunk, polluting embeddings with "Confidential — Do Not Distribute" over and over. Neither fix is clever; both are the difference between usable and unusable text.

Cleaning HTML down to the article

HTML is mostly not content. A news article page might be 4% article and 96% navigation, ads, related-links, and footer. Embedding the whole thing buries the signal. The job is to find the main content node and throw away the rest.

# pip install trafilatura
import trafilatura

def extract_article(html):
    """Pull just the main article text out of a web page,
    discarding nav, ads, boilerplate."""
    # trafilatura is purpose-built for main-content extraction
    text = trafilatura.extract(
        html,
        include_comments=False,   # drop comment sections
        include_tables=True,      # keep tables (we handle them below)
        favor_precision=True,     # prefer clean over complete
    )
    return text or ""

with open("blog-post.html", encoding="utf-8") as f:
    article = extract_article(f.read())

print(f"Article length: {len(article)} chars")
print(article[:200])

Article length: 5840 chars
Hybrid search combines the precision of keyword matching with
the recall of semantic search. In this post we walk through
why neither approach alone is sufficient for production...

A general-purpose HTML-to-text converter would have returned 60,000 characters here, most of it menu labels and "subscribe to our newsletter." The purpose-built extractor returned 5,840 characters of actual article. That ratio is typical, and it is why "just strip the tags" is the wrong instinct for the web.

The table problem

Tables are where naive extraction does its quietest damage. Consider a small pricing table:

The left is what a text dump produces — every number orphaned from its meaning. The right serialises each row into a sentence that carries its own headers, so a chunk containing one row is still fully interpretable.

The principle is to make each row self-describing: fold the column headers into every row so that, no matter how the table gets chunked later, each piece still says what it means.

def serialise_table(headers, rows, caption=""):
    """Turn a table into one self-describing sentence per row.
    Each row keeps its column names so it survives chunking."""
    sentences = []
    for row in rows:
        parts = [f"{h}: {v}" for h, v in zip(headers, row)]
        line = "; ".join(parts)
        if caption:
            line = f"[{caption}] {line}"
        sentences.append(line + ".")
    return "\n".join(sentences)

headers = ["Plan", "Price", "Seats"]
rows = [["Free", "$0", "1"], ["Pro", "$20", "5"], ["Team", "$50", "20"]]
print(serialise_table(headers, rows, caption="Pricing"))

[Pricing] Plan: Free; Price: $0; Seats: 1.
[Pricing] Plan: Pro; Price: $20; Seats: 5.
[Pricing] Plan: Team; Price: $50; Seats: 20.

Now a user asking "how many seats does Pro include?" retrieves a chunk that unambiguously says "Plan: Pro; Price: $20; Seats: 5." The naive version would have retrieved "Pro 20 5" and left the model to guess whether 20 was the price or the seat count. This single technique fixes a large share of "the numbers are wrong" complaints in production RAG.

Scanned documents and OCR

A scanned PDF has no text — it is an image of text. You must run optical character recognition first. The thing to internalise is that OCR is lossy: it will misread some characters, and those errors flow downstream into your embeddings and answers. Plan for imperfection.

# pip install pytesseract pdf2image  (also needs system tesseract + poppler)
import pytesseract
from pdf2image import convert_from_path

def ocr_pdf(path, min_confidence=60):
    """OCR a scanned PDF page by page. Flag low-confidence pages
    so a human can review them instead of trusting silently."""
    pages = convert_from_path(path, dpi=300)  # higher dpi = better OCR
    results = []
    for i, image in enumerate(pages):
        data = pytesseract.image_to_data(
            image, output_type=pytesseract.Output.DICT)
        words = [w for w, c in zip(data["text"], data["conf"])
                 if w.strip() and int(c) >= min_confidence]
        confidences = [int(c) for c in data["conf"] if int(c) > 0]
        avg_conf = sum(confidences) / len(confidences) if confidences else 0
        results.append({
            "page": i + 1,
            "text": " ".join(words),
            "avg_confidence": round(avg_conf, 1),
            "needs_review": avg_conf < 80,  # surface shaky pages
        })
    return results

pages = ocr_pdf("scanned-contract.pdf")
for p in pages[:3]:
    flag = "  ⚠ REVIEW" if p["needs_review"] else ""
    print(f"Page {p['page']}: confidence {p['avg_confidence']}%{flag}")

Page 1: confidence 94.2%
Page 2: confidence 71.5%  ⚠ REVIEW
Page 3: confidence 96.1%

The detail that matters: capturing a confidence score and flagging weak pages. A silent OCR pipeline ingests a 71%-confidence page full of misreadings and serves wrong answers from it forever. A pipeline that surfaces "page 2 is shaky" lets a human glance at it. The cost is a few lines; the payoff is not grounding your system in a machine's misreading of a blurry fax.

The metadata you must capture now

During prep you have information you will never be able to recover later, once the document is shredded into chunks. Capture it as structured metadata attached to every chunk:

Source — the document title and a stable URL or identifier, so the chunk can cite itself later.
Position — page number, section heading, or slide number. This powers precise citations and helps reassemble context.
Timestamps — created and last-modified dates, so you can prefer fresh material and expire stale material.
Type and confidence — was this OCR'd? From a table? Low-confidence? Downstream stages can treat shaky sources more cautiously.
Access scope — who is allowed to see this, captured at ingestion. Retrofitting per-user access control later is painful; we return to it in the production wave.

My take. Metadata feels like bureaucracy at ingestion time and turns out to be the difference between a toy and a product. The single most common regret I hear from teams six months in is "we didn't capture the source position, so now our citations just say 'somewhere in this 200-page PDF.'" Capture more than you think you need. Storage is cheap; re-ingesting fifty million documents is not.

When this fails

Trusting a single parser for everything. The parser that's perfect for clean born-digital PDFs produces nonsense on scans, and vice versa. Detect the document type first and route to the right extractor. A one-size pipeline silently mangles whatever doesn't fit.
Never looking at the output. The cardinal sin. Extraction runs, no errors are thrown, everyone assumes it worked. Meanwhile every table is wordsoup. There are no exceptions for "the text came out wrong but technically parsed" — only your eyes catch that.
Linearising tables. Covered above, and worth repeating because it is so common: a table dumped as plain text is actively misleading, not merely incomplete. Serialise rows.
Silent OCR. Ingesting low-confidence OCR without flagging it grounds your answers in misreadings. Capture confidence; surface the weak pages.
Throwing away structure. Headings, sections, and slide boundaries are gifts — they are natural chunk boundaries and citation anchors. A parser that flattens everything to one blob throws that gift away, and the next chapter on chunking will have nothing to work with.

Practice — before you read the next chapter

Read your own extraction

Take five real documents from your domain — ideally a mix: a PDF, a web page, something with a table. Extract the text with any tool. Then read the raw output, all of it, slowly. Write down every defect you spot: merged columns, lost tables, repeating headers, broken characters. This list is your data-prep backlog, and it is almost always longer and more important than people expect.

Fix one table

Find a document with a table that matters — a pricing sheet, a spec table, a comparison. Extract it naively and see the wordsoup. Then serialise its rows into self-describing sentences by hand. Embed both versions (you'll learn how in Chapter 04) and notice which one a relevant query retrieves. The difference is usually stark.

Design your metadata schema

Before you ingest anything for real, write down the exact fields you'll attach to every chunk. Use the list above as a starting point and add whatever your domain needs — author, document category, language, version. Committing to this schema now saves a full re-ingestion later.

Takeaways

Data prep is usually the real ceiling on RAG quality — a bigger lever than the embedding model or the reranker. Spend your first week here.
The funnel is extract → clean → normalise → tag. Each step is more opinionated than the last; all of it happens before chunking.
Each document type breaks in a specific way. Detect the type and route to the right parser rather than trusting one tool for all.
Serialise tables into self-describing rows. A linearised table is misleading, not just incomplete.
OCR is lossy — capture confidence and flag weak pages instead of trusting silently.
Capture source, position, timestamps, type, and access scope as metadata now. You cannot recover it once documents become chunks.
Read your extracted text with your own eyes. It is the highest-value habit in the whole pipeline.

Next chapter: Chunking — the hardest problem. Now that you have clean text, how do you split it? The chunk is the unit of retrieval, and how you cut determines what can ever be found. We'll measure the strategies, not just list them.

Discussion

Foundations — the RAG vocabulary Chunking — the hardest problem