Tooling — the 2026 honest tour

Every week brings a new framework that promises RAG in five lines, a new vector database that's fastest by some benchmark, a new eval tool, a new managed service that hides the whole pipeline behind an endpoint. The noise is exhausting, and worse, it's optimised to make you feel behind. This chapter is the antidote: an opinionated map of the tooling landscape as it stands in 2026, organised by what each layer actually does for you, what it costs you in control, and the one question that cuts through all of it — should you adopt this, or write the thirty lines it would replace?

What you'll take away from this chapter

The layers of the RAG tooling stack, and what each genuinely does
When an orchestration framework helps and when it just hides the pipeline you need to understand
How to read eval tools and managed RAG services without the marketing
The build-versus-adopt decision, reduced to a question you can actually answer
Why the durable skill is understanding the pipeline, not memorising the tools

The layers of the stack

Tools cluster into layers that map onto the pipeline you've spent this series building. Seeing them as layers — rather than a flat list of competing brand names — is what lets you choose one per layer instead of being sold a bundle.

Four functional layers, plus the managed option that bundles them. You can mix and match the top four — your own orchestration over a hosted model over pgvector with an open eval tool — or hand the whole stack to a managed service. The trade is always the same: control versus convenience.

Orchestration frameworks

The big frameworks (the LangChain and LlamaIndex lineage) give you pre-built components for every stage — loaders, splitters, retrievers, chains — and wire them together. Their genuine value is prototyping speed: you can stand up a working pipeline in an afternoon and try ideas fast. Their genuine cost is abstraction: they hide the very decisions this series taught you to make deliberately. When chunking, retrieval, and generation are three method calls with default parameters, you lose sight of what's actually happening — and when quality is mediocre, you can't tell which hidden default is the culprit.

My take. Use a framework to prototype and learn the shape; consider graduating to a thin pipeline of your own once you understand what you need. Here's the uncomfortable observation behind that: a production RAG pipeline, written directly, is often only a few hundred lines — embed, store, hybrid-retrieve, rerank, generate — and every line is one you understand and can tune. Frameworks shine when you're exploring and can become a layer of mystery you debug through once you're optimising. This isn't anti-framework; it's pro-understanding. The framework is scaffolding, and scaffolding is meant to come down. If yours is helping you ship and you can still see through it, keep it.

Model providers

Embedding, reranking, and generation models come as hosted APIs or self-hosted weights — the choice you weighed in Chapter 04. The tooling point: keep this layer swappable. Wrap each model behind a thin interface of your own so that switching an embedding model (and triggering the migration from Chapter 04) or moving a generator from API to self-hosted is a config change, not a rewrite. The fastest-moving part of the whole stack is the models; build so you can move with them.

Evaluation tools

Eval frameworks (the RAGAS lineage and others) package the metrics from Chapter 11 — faithfulness, relevance, context quality — so you don't implement them from scratch. They're a real time-saver and a reasonable starting point. The caution is the same one from Chapter 11: an eval tool's LLM-judge metrics are only as trustworthy as their agreement with your human judgement. Adopt the tool for convenience, but validate its scores against a human-graded sample before you trust it to gate releases. A borrowed metric you haven't validated is a number, not a measurement.

Managed RAG services

At the far end, managed services swallow the entire pipeline: you send documents and queries, they handle chunking, embedding, storage, retrieval, and generation behind one endpoint. The appeal is real — fastest possible start, nothing to operate. The costs are equally real: little control over the decisions that determine quality (you can't tune a chunking strategy you can't see), potential lock-in, and the data-residency and access-control questions from Chapter 16 now depend entirely on the vendor. Managed services fit when RAG is peripheral to your product and you want it handled; they fit poorly when retrieval quality is your product and you need to tune it.

The build-versus-adopt question

Cut through every tooling debate with one question: is this capability core to your product, or peripheral? For peripheral capabilities, adopt the highest-level tool that works and move on — your effort belongs elsewhere. For core capabilities — the ones that differentiate your product and that you'll need to tune repeatedly — bias toward building, or at least toward tools transparent enough to tune. You do not want your central differentiator hidden inside an abstraction you can't see into.

Situation	Lean toward	Because
Prototyping, learning the shape	Framework	Speed of iteration beats control here.
RAG is peripheral to the product	Managed service	Effort belongs on your actual product.
Retrieval quality is the product	Build / transparent tools	You'll tune the core constantly; you must see it.
Standard need, up to a few M vectors	pgvector + thin pipeline	Already covered in Chapter 05; minimal new surface.
Measuring quality	Adopt eval tool, then validate	Don't reimplement metrics; do verify them.

When this fails

Choosing tools before understanding the pipeline. Adopting a framework to avoid learning how RAG works leaves you unable to debug it when it underperforms. Understand the stages first (you now do); then the tools are just conveniences.
Framework lock-in to defaults. Accepting a framework's hidden chunking and retrieval defaults means shipping someone else's untuned choices. If you use a framework, know and override its defaults deliberately.
Trusting an eval tool's scores unvalidated. Borrowed LLM-judge metrics can disagree with your human judgement. Validate against a graded sample before letting any tool gate releases.
Managed service for a core capability. Outsourcing the pipeline when retrieval quality is your differentiator caps your quality at the vendor's and surrenders the tuning you most need.
Chasing the new thing. The landscape churns weekly; rewriting your stack for each new framework is motion, not progress. Keep the layer interfaces thin and swap individual pieces on measured need, not on news.

Practice — before you read the next chapter

Map your stack to the layers

Write down what you use (or plan to) at each layer: orchestration, models, storage, evaluation. For each, note whether you adopted a tool or built it, and whether that capability is core or peripheral to your product. Mismatches — a built peripheral, an adopted core — are where to reconsider.

Count the lines

If you use a framework, try writing the bare pipeline yourself — embed, store, retrieve, rerank, generate — and count the lines. The number is usually smaller than people expect. Whether or not you switch, the exercise reveals exactly what the framework was doing for you, and that visibility is worth having.

Validate one borrowed metric

Take one metric from an eval tool you use, grade ten of the same examples by hand, and compare. The agreement (or gap) tells you how much to trust that tool's numbers — and turns a borrowed metric into one you've actually verified.

Takeaways

RAG tooling has four functional layers — orchestration, models, storage, evaluation — plus managed services that bundle them all. Choose per layer, not by brand bundle.
Frameworks excel at prototyping and can obscure the decisions that determine quality. Use them to learn the shape; keep them only while you can still see through them.
Keep the model layer swappable behind a thin interface — it's the fastest-moving part of the stack.
Adopt eval tools for convenience, but validate their judge-based scores against human judgement before trusting them.
The decisive question is core versus peripheral. Adopt high-level tools for peripheral capabilities; build or use transparent tools for the ones that are your product.

Next chapter: RAG in the wild — three case studies. Enough principles — let's watch them collide with reality. Three real-shaped systems, the specific decisions their builders made, what went wrong, and what the fixes teach. The whole series, seen through three concrete builds.

Discussion

Security and compliance — injection, access control RAG in the wild — three case studies