On this tutorial

Agentic SDLC: A Field Manual for Building Software with AI Agents

Foundations

Phases

Synthesis

Capstone

Capstone — a feature, end to end

Tooling — the honest tour

The risk in writing a tooling chapter is that it ages badly. Specifically named products change pricing, features, and ownership. New entrants displace incumbents. The "we just shipped this" of today is the "what happened to them" of next year. So this chapter takes a different approach. Rather than rank products, it categorises them, describes the shape of each category, and tells you what to look for. The category framework should still make sense in three years.

This is the chapter to come back to when you're evaluating a new tool. Read it once for the lay of the land, then use it as a checklist when you're being pitched.

What you'll take away from this chapter

The eight categories of tooling in the 2026 agentic stack and what each is actually for
The evaluation criteria that matter — and the ones that don't, despite vendor emphasis
Which categories you almost certainly need; which you might; which you probably don't yet
The questions to ask before paying for a category-defining product
The two patterns that consistently sell well and consistently disappoint

The eight categories

Eight categories, three layers. Most teams need something in each of the user-facing categories and the model API. The rest, optional and stage-dependent.

One paragraph per category, focused on the decision you're making when you pick.

1 · IDE / editor agents. Where most engineers meet agents. The decision is fit with your team's existing editor preferences, plus the quality of the diff (small, focused, correct), context handling, and latency. What doesn't matter as much as vendors claim: which model it's "powered by." The harness matters more than the model name.

2 · CLI agents. For engineers who live in tmux. The decision is composability with shell pipelines and the ability to script the agent. Strong case if your team already scripts the dev environment; weak case otherwise.

3 · Background agents. CI bots, schedulers, librarians. The decision is permission scoping (Ch. 06), draft-by-default behavior, and audit trail. A few well-designed background agents are valuable; ten of them is usually a sign someone is automating problems that should be solved differently.

4 · Models & APIs. Honest take in 2026: the top three or four providers are all good enough for most agentic SDLC use cases. The differences that show up in vendor benchmarks rarely show up in your workflow. Pick on price, latency, ergonomics, and ecosystem (SDK quality, tool-use support, structured outputs). Switch if you find a real workflow difference.

5 · MCP servers. The connective tissue of the ecosystem. The decision for most teams is whether to write your own for internal tools. The answer is almost always yes — usually a few hours of work, and the result is reusable across every agent. The investment compounds.

6 · Sandboxes. Isolated execution. The decision is startup speed for interactive workflows, network controls, and cost. If you're running agents on developer laptops only, your sandbox is the developer's container. If you're running them in CI or production, you'll want something more deliberate.

7 · Eval & observability. Younger category. Many teams build their own minimal version (logging trajectories to object storage, querying with a notebook) before adopting a tool. That's reasonable. The category will mature.

8 · Governance & safety. Important for regulated industries, larger organisations, and any team where agents touch production. For smaller teams, often "we'll handle it with existing tools." Worth being deliberate before adopting a dedicated product.

Evaluation criteria that matter

When evaluating any agentic tool, ask in this order:

Does it do what we need on our actual workflows? Run a one-week trial on real work. Vendor demos are unreliable. Your own workflows are the only ground truth.
How well does it integrate with what we already have? The best tool that doesn't fit your stack loses to a worse tool that does.
What's the failure mode? When it goes wrong, how does it go wrong? Quietly producing bad output is worse than loudly producing none.
Can we leave? If the vendor doubles prices or gets acquired, what's the migration story? Lock-in matters more in this category than people assume.
What's the actual cost? Not the sticker price — the total: tokens, infrastructure, engineering time to operate.

Criteria that matter less than vendors say

Benchmark scores. Run your own. Vendor benchmarks are optimised for vendor benchmarks.
"Powered by [model]." The harness matters more than the model name.
Number of integrations. A hundred mediocre integrations beats five good ones in the brochure and loses to them in real work.
Feature breadth. Tools that do one thing well outperform tools that do five things acceptably.
"Used by [big company]." Useful as a sanity check, not as a recommendation. The big company is solving different problems.

What a small team almost certainly needs

For a 5–20 engineer team doing serious agentic SDLC in 2026, the minimum stack:

An IDE or CLI agent that fits your workflow (categories 1 or 2)
An API provider for the underlying model (category 4)
A way to log trajectories and metrics, even if it's homemade (category 7)
A project prompt and decision log (not tools, but the artifacts the tools operate on)

That's the floor. Many strong teams run on exactly this for a year before adding anything else.

What you might need

Background agents in CI (category 3) — once existing workflows are healthy and you have specific recurring tasks to automate
Custom MCP servers (category 5) — once you have internal tools the agent should reach
Sandboxing (category 6) — once agents run anywhere other than developer laptops
Dedicated eval tooling (category 7) — once your eval needs outgrow notebooks

What you probably don't need yet

Dedicated governance platforms (category 8), unless you're in a regulated industry or large enough to need them
Multi-agent orchestration frameworks. Most teams hit the wall described in Ch. 09 — orchestration adds complexity faster than it adds value
Fine-tuned in-house models. For agentic SDLC, frontier models from major providers are nearly always better; fine-tuning small models is rarely worth the operational cost
"AI engineering platforms" that bundle everything. The all-in-one pitch sounds appealing; the all-in-one product usually does each thing worse than the best-of-breed option

The "what would we lose" test. For any tool you're using or considering, ask "what would we lose if we stopped using this tomorrow?" If the answer is "convenience and a few hours of work," the tool is replaceable. If the answer is "we'd have to rebuild our entire workflow," you have lock-in worth being aware of. The honest answers, written down, change purchasing decisions.

Questions to ask before paying

Before signing up for any paid tool in this space:

"Can we run a one-week trial on real work?" If no, walk away.
"What does the trajectory log look like?" If you can't see what the tool does, you can't debug it later.
"How is data handled?" Your codebase goes into the tool. Where does it go after? Is it retained? Used for training?
"What's the upgrade story?" When the model behind the tool changes, what changes for you?
"Who else on the team has used this?" Talk to someone other than the vendor's reference customer.

The two patterns that consistently disappoint

The "AI engineer" platform

A pitch that the platform will be a full team member — taking tickets, writing code, reviewing PRs, deploying. The demo is impressive. In production, the failure modes are systemic: an autonomous agent making the wrong product call in chapter three means everything after chapter three is wrong, and the audit trail is hard to reconstruct. Teams that adopt these usually end up with a constrained subset of the original promise, with humans inserted at each step.

What to do instead: adopt the constrained subset directly. Use the tool as a reviewer, fixer, or investigator. Skip the "autonomous engineer" framing.

The "no prompts needed" platform

"Our AI just understands what you want." It usually doesn't. The prompts are inside the platform, hidden from you, tuned for the average case. Your team isn't average; your codebase isn't average; the hidden prompts are leaving value on the table.

What to do instead: choose tools that expose their prompts and let you customise them. The project-prompt patterns from Ch. 04 only work when the tool lets you write them.

A note on cost

The model API bill is the cost most teams notice. There are usually two costs they don't:

The cost of switching. Each time you move between IDE agents, CLI agents, or model providers, the team loses a week to relearning. Budget for it.
The cost of not having a senior engineer own this. Tools change weekly. Someone needs to track that, evaluate, and steer. If no one owns it, the team lands on whatever tool was hot on Twitter last Tuesday — which is almost never the best choice.

Practice — before you read the next chapter

If you're early

Map your current stack to the eight categories. Note which categories you're using and which you're not. Don't add anything yet — just know where you are.

If you're mature

For each tool you currently pay for, run the "what would we lose" test. Honest answers, written down. Decide which renewals to question at the next cycle.

If you lead a team

Assign someone (might be you) to own "tooling situational awareness" — knowing what's good, what's emerging, what's worth evaluating. The role takes maybe two hours a month done well; it pays for itself many times over.

Takeaways

Eight categories, three layers: user-facing, infrastructure, meta. Most teams need user-facing tools and a model API; the rest are stage-dependent.
Evaluate on your own workflows, not vendor benchmarks. A one-week trial on real work is the only reliable signal.
The harness matters more than the model. "Powered by X" is rarely the deciding factor.
Lock-in matters. The "what would we lose" test should inform purchasing.
The "AI engineer" and "no prompts needed" pitches consistently underdeliver. Constrained, prompt-exposed tools win.
Someone needs to own situational awareness for tooling. The market moves weekly.

Next chapter: Capstone — a real-shaped feature built end-to-end with an agent in the loop. Every prior chapter's lessons composed into one workflow, with the time and cost accounting that tells you what to expect.

Discussion

Team — review, onboarding, seniority Capstone — a feature, end to end