Every engineering organisation that tries to ship features with agents hits the same wall. Someone writes what feels like a clear spec, hands it to the agent, and gets back something that compiles, passes tests, and isn't what was wanted. The wall isn't the agent. It's that the specs we write for humans were never meant to be executed — they were meant to be interpreted by someone who already knows the codebase, the team, and what "make it look professional" usually means around here.
This chapter is about the kind of spec that an agent can execute, the discipline it requires from product and engineering, and the organisational shift that comes with adopting it. The spec is where most of the value of agentic SDLC is captured or thrown away.
Consider a typical bullet from a product spec: "Users should be able to easily filter the dashboard by date range, with sensible defaults."
To a human engineer who knows the product, this is almost enough. They know which dashboard. They know what "easily" means in your design system. They know "sensible defaults" means last 30 days because every other view does that. They know the dropdown belongs in the filter bar, not the header. They know it should persist in URL params because all your filters do.
To an agent, the same bullet is six unmade decisions. The agent will make all six. Some will be right. Some will be wrong. You won't know which until you read the code — and at that point you've spent more time reading than you would have spent writing the spec properly in the first place.
The principle: an agent-ready spec makes the implicit explicit. Not all implicit knowledge — codebase conventions are still inherited from the code itself — but the implicit product decisions, the ones a fresh contributor would have to ask about over coffee.
What problem is being solved, for whom, and why now. Two or three sentences. The agent uses this to break ties when the behavior section is silent — and the behavior section is always silent on something. Without context, ties get broken arbitrarily.
What the system should do, concretely. Bullets are fine. Each bullet should be something you could write a test for. "The dashboard should be fast" is not a behavior bullet. "Initial render under 200ms on 10K rows" is.
The checklist that makes the task done. Phrased as observable outcomes. The agent will use these as its definition of success and as its self-check before declaring the work complete. If reviewing this list takes you longer than reading the diff, the criteria are too vague.
The single most-skipped section, and the source of more wasted agent time than any other. If you don't say "don't add a settings page for this," you might get a settings page. If you don't say "don't migrate the existing data," you might get a migration. Both are reasonable interpretations of "make this work properly." Both might be wrong.
Files to read, patterns to follow, examples of similar features already in the codebase. The agent will read your codebase regardless, but pointers shortcut the search and bias it toward the right patterns. Think of this as a curated reading list, not exhaustive references.
The deepest pitfall in spec writing is criteria that can be satisfied by code doing the wrong thing. Three examples:
| Ambiguous | Concrete |
|---|---|
| "The new endpoint returns user data." | "GET /api/users/:id returns JSON {id, email, name, createdAt} for a valid id, 404 for missing, 401 for unauthenticated." |
| "Errors are handled gracefully." | "Network errors show a toast with retry; validation errors display inline next to the field; auth errors redirect to /login." |
| "The query is performant." | "On a database with 1M users, the endpoint returns p95 under 100ms. A test asserts this on the staging dataset." |
Pattern: turn adverbs (gracefully, performant, easily) into observable thresholds. If you can't write a test for the criterion, neither can the agent.
The adverb test. Scan acceptance criteria for adverbs ending in "-ly." Gracefully, properly, sensibly, cleanly, efficiently. Each one is a smell. Replace with the observable behavior the adverb is gesturing at. Apply this discipline to the spec template your team uses and you'll find half of it disappears in the rewrite.
To make this concrete, here is the structural skeleton — not the prose — of an agent-ready spec for "add CSV export to reports":
# Add CSV Export to Reports
## Context
[2-3 sentences on who asked, why now, and what failure mode this fixes.]
## Behavior
- Where the action appears, who sees it, what it does.
- The async/email path for large exports.
- How filters and column ordering carry over from the view.
## Acceptance criteria
- For typical-size reports, observable performance threshold.
- For large reports, observable async behavior with bounded delivery time.
- For free-tier users, the feature is absent (tested).
- For failed exports, observable error UX (no 500s, no silent failures).
## Out of scope
- Excel format. CSV only.
- Recurring schedules. Separate feature.
- Custom column selection. Uses what's on screen.
- Free-tier exposure. Pro only; no upsell UX.
## Pointers
- Existing CSV generator lives at [path].
- Async pattern to follow: [existing similar feature].
- Background job system: [name], with example at [path].
That structure, filled in, takes about 25 minutes to write. It saves substantially more than 25 minutes on the back end. The numbers, observed across teams, look roughly like this:
| Path | Spec | Agent run | Review & rework | Total |
|---|---|---|---|---|
| Bad spec + agent | 5 min | 30 min | 60 min | 95 min |
| Good spec + agent | 25 min | 20 min | 15 min | 60 min |
| No agent, you build | — | 90 min | — | 90 min |
These numbers are illustrative. The pattern is robust: good spec plus agent beats bad spec plus agent, often beats no agent at all. And the spec doubles as documentation that survives the feature.
An obvious move: use the agent to help write the spec. This works, with one rule — don't let the agent write the context or out-of-scope sections. Those are product decisions only the human (with domain context) can make. But behavior bullets, acceptance criteria, and pointer hunting are excellent first-draft work for an agent.
The collaboration pattern that holds up: human writes context and constraints; agent drafts the executable parts after reading the codebase; human reviews and tightens. This is one of the most reliable patterns we'll see throughout the series and it generalises beyond specs.
Adopting agent-ready specs isn't a tooling change. It's a discipline change, and it lands harder on product than on engineering. Three shifts worth naming explicitly:
Product writes more. The implicit knowledge that used to live in slack threads and quick desk conversations now has to live in the spec. PMs and product engineers will resist; the resistance is usually rational (it's more work) and short-sighted (it saves more downstream).
Engineering review of specs becomes a real activity. A spec that's ambiguous-but-shippable for a senior engineer is unworkable for an agent. Engineering has to push back on vague specs before work starts, not during code review. This is a habit shift; some teams find it harder than the tool adoption.
"It depends" gets pushed earlier. The agent will not stop to ask "what about case X?" the way a senior engineer might. Either case X is specified, or it gets handled however the agent guesses. Hard product questions surface during spec writing, where they're cheap, instead of during review, where they're not.
Teams that already have spec discipline find this easy. Teams that don't will find that agentic SDLC exposes a weakness they had been carrying for years.
Take a feature you've shipped in the last six months. Write the spec you wish you'd had — using the five sections from this chapter. Then ask yourself: which sections did the original spec have, which did it lack, and where did the missing sections cause rework?
Pick an upcoming feature on your team's backlog. Write two versions of the spec — a "normal" one and an agent-ready one. Hand both to an agent in separate sessions and compare what comes back. Most engineers find the second one produces a usable diff on the first run; the first rarely does.
Pull the last ten product specs your team received. For each, count the adverbs in the acceptance criteria. Bring the count to your next product partnership conversation. This is not blame; it's data about a process that benefits from explicit attention.
Next chapter: Design — architecture decisions that change when an agent will maintain the code. The patterns that age well over months of agentic maintenance, and the ones that quietly cost you when no human revisits the code anymore.
Sign in to join the discussion and post comments.
Sign in