Every prior chapter has shown a piece of the practice. This chapter shows them composed. You'll walk through a full feature, from "we want this" to "it's shipped," with an agent in the loop the whole way. Specs, design, build, tests, review, deploy, observe. The point isn't the feature — it's the rhythm and the decisions made at each stage. By the end you should be able to picture yourself in the same workflow on your own work next week, and you should have a realistic sense of where the wins are and what they cost.
This is the longest chapter in the series. Take it slowly. The decisions in each stage build on the ones before.
We're building "Scheduled Exports" for a hypothetical SaaS analytics product. Customers can already run one-off CSV exports of their reports. The new feature lets them schedule exports — daily, weekly, monthly — delivered to email or an S3 bucket. The feature touches the database, the UI, background jobs, email, and external integrations. Not toy-sized; not enterprise-epic-sized. The kind of feature most teams ship in a sprint or two.
We'll follow it through ten stages. Each stage maps to one or two prior chapters; the cross-references are explicit so you can jump back if a step is unclear.
The PM brings a one-paragraph product spec. Following Ch. 02, we don't hand it to the agent as-is. We rewrite it into the five-section format, with the agent helping draft the executable parts.
The behavior section ends up listing: where the action appears in the UI, who sees it (Pro tier only), the schedule options (daily/weekly/monthly), the two destination types (email and S3), what happens with the customer's time zone, and the catch-up behavior when the system was down through a scheduled run.
The acceptance criteria are concrete and measurable: a daily schedule fires within its window for a typical user, large exports use the existing async pathway, deletion is reflected in the UI within five seconds, S3 delivery is exercised end-to-end against localstack, p95 latency under 60 seconds for email and 5 minutes for S3, free-tier users see no UI for this feature.
The out-of-scope section ends up being the longest part of the conversation — and the most valuable. We explicitly exclude Excel format, custom column selection, multiple destinations per schedule, sub-daily schedules, and compression. Each exclusion forecloses a class of agent drift later. This is where 80% of the spec's value comes from.
Pointers list the relevant existing modules: the one-off CSV export, the inngest pattern from a similar scheduled job, the S3 integration with its assume-role auth pattern, and the schedule-picker component already in use elsewhere.
Time spent: forty minutes, of which fifteen were the PM and engineer agreeing on out-of-scope. The agent drafted behavior and acceptance criteria; the human wrote context and out-of-scope; the pointer section came from a quick agent scout pass.
Following Ch. 09's scout pattern. Before writing any code, we send an investigator agent to map what's already there. The scout reads the one-off export module, the inngest patterns, the S3 integration, and the schedule-component pattern, then reports.
The scout's findings shape the rest of the work: the CSV generator is reusable as-is; the existing inngest "cron trigger" feature can replace the custom scheduler the design instinctively reached for; the schedule-picker is reusable. New work is limited to a persistent table, a worker function, the email/S3 dispatch logic, and the missed-run catch-up logic.
Critically, the scout surfaces a question the spec missed: the existing one-off export emails from a generic "noreply" address. Scheduled exports could use the same — or use a per-customer "from" address, which opens a separate design conversation. We add a one-line decision to the spec ("same noreply for v1") rather than letting the agent choose by default.
Total time on the scout: twenty minutes of agent work, ten minutes of human review. The investment buys us a much sharper plan in the next stage.
We use the dual-prompt pattern from Ch. 09. One agent proposes an architecture; another critiques it; the human picks.
The proposer's draft has three components: a new scheduled_exports table, an inngest cron function that polls it, and a worker function that runs the export and dispatches to email or S3.
The critic finds three weaknesses. The most important: the proposed approach has the cron function dispatching exports synchronously, meaning a slow run blocks the cron function. The critic proposes decoupling — cron emits "export due" events; a separate worker consumes them. Matches the existing inngest pattern; cleaner separation.
The human accepts the critic's revision. The plan crystallises into seven PRs, each with a clear seam:
Each PR is independently reviewable. PRs 1 and 2 ship to staging before any UI work, validating the API shape before building UI on top. Time for design and planning, including the dual-prompt iteration: forty-five minutes.
Now we work through the seven PRs. Most go cleanly. Two have meaningful detours worth naming, because they show where humans stay essential.
PR 4 (the cron and event): the agent proposed using a simpler "tick every minute" implementation rather than inngest's cron trigger, citing testability. The human pushed back — using the existing inngest pattern matters more than testability of a single component. The agent agreed, switched approach, and the PR shipped clean. The lesson from Ch. 04: agents sometimes propose technically valid alternatives that don't fit the codebase's conventions. "Obvious naming" and "match existing patterns" from Ch. 03 are why the pushback was right.
PR 6 (S3 delivery): the first agent run produced tests that mocked the S3 client and asserted that the right method was called — the over-mocked pattern from Ch. 05. The human asked for tests using localstack instead. The agent rewrote them. The lesson: even in a workflow where you've explicitly avoided weak test patterns, they show up; review every test, not just the code.
The other five PRs go through delegate mode (Ch. 04) with stop-budgets and clean reviews. Average build time per PR: about half what a comparable human-only implementation would take.
Tests have been written alongside each PR. At the end of the build, we step back and check: do the acceptance criteria from Stage 1 have corresponding tests? We use Ch. 05's "what does this test break on?" check on each.
Two acceptance criteria have no clear test. The "missed runs re-run within 4 hours" criterion has no test for the catch-up behavior. The latency criterion has no test that asserts the threshold. We add both — an integration test that simulates the scheduler being down for a window and asserts catch-up, and a load test that fires a thousand simulated schedules and measures latency.
Both were possible to ship without; both would have been bugs waiting to happen. Time for this stage: an hour, including the new tests. The exercise of mapping criteria to tests is the discipline; the tests themselves take care of the rest.
Review-bot (Ch. 06) runs on every PR. Across the seven PRs, review-bot catches: missing JSDoc on three public functions; two opportunities for shared types between API request and model layer (we adopt one, reject the other as premature DRY per Ch. 03); a subtle race condition in the cron function around "is this schedule already running" (real find, fixed with a database-level lock); and various style nits the formatter auto-fixed.
Human review focuses on what review-bot couldn't catch:
cancelSchedule vs. deleteSchedule. We're "deleting" not "canceling." Renamed.The split between bot review and human review is the lesson. The bot catches the systematic issues; the human catches the judgment calls. Neither can replace the other.
Following Ch. 06's gates framework. The actions and their gates:
| Action | Gate |
|---|---|
| Merge to main | Standard PR approval (no agent gate added) |
| Deploy to staging | Automatic via existing CI |
| Deploy to production | Human approval per existing policy |
| Enable feature flag for users | Explicit human toggle, staged rollout |
We deploy to staging on Tuesday. Run synthetic schedules for two days. Roll out to 5% of Pro users on Thursday. To 50% Friday. To 100% the following Tuesday after the weekend goes clean.
Nothing exotic here — this is the same staged-rollout discipline you'd apply without agents. The point is that agents don't change deployment hygiene; if anything they reinforce why you want it.
Following Ch. 07. The four pillars are in place. In the first week post-launch, observability surfaces two surprises.
One customer has 1,200 daily schedules. We expected a handful per customer. They turn out to be a power user running per-team schedules. Performance handles it, but the cron function is now processing more than we modeled. We add an alert on schedules-per-tenant exceeding a threshold.
S3 delivery latency p95 is 4 minutes, near our threshold. Tracing shows the bottleneck is the IAM role assume call happening on every delivery. We add a 15-minute cache for assumed credentials. Latency drops to 30 seconds p95. A decision log entry is written.
Both surprises were findable only because observability was in place from day one. Adding it after the first incident is always too late; the data you wanted doesn't exist yet.
Three months later, a bug report: scheduled exports for the "Top Products" report stopped working last Tuesday. Following Ch. 08's archaeology workflow:
Total time: forty minutes, of which thirty were investigation and ten were the fix. Without the decision log, changelog, and external memory built up over months, the same investigation would have taken hours.
Across the whole project, we write decision-log entries for: using inngest cron triggers, decoupling cron from worker, the catch-up semantics, the IAM credential cache, and the per-report end-to-end export tests. Five entries, five lines each. Future-us (or future-agent) reads them next time someone touches this area and saves hours.
An honest estimate of how this feature compares to the same work done without agents:
| Stage | Without agents | With agents |
|---|---|---|
| Spec | 1.5 hr | 40 min |
| Scout | 3 hr (manual reading) | 30 min |
| Design + plan | 2 hr | 45 min |
| Build (7 PRs) | 4.5 days | 2 days |
| Tests | 1 day | 4 hr |
| Review | 3 hr | 2 hr |
| Deploy | 4 hr | 4 hr |
| Observe + tune | 4 hr | 3 hr |
| Total | ~8 working days | ~4 working days |
Roughly half the time. The deploy stage is unchanged because that's mostly waiting for staged-rollout windows. The biggest gains are in build, scout, and spec — the parts where the agent's speed at reading, writing, and pattern-matching pays off most directly.
Token cost across the project: about $40 in API charges. Compared to the engineering time saved, it doesn't move the needle. Don't optimise the cheap thing.
Looking back at the project as a whole:
Agent did the bulk of: initial code writing, test scaffolding, scouting unfamiliar parts of the codebase, drafting documentation, applying mechanical refactors.
Human did the bulk of: writing context and out-of-scope, making product decisions, catching "looks plausible but" mistakes, deciding when the agent had drifted from conventions, owning the decision log.
Where it broke down: the agent's first attempt at the S3 tests was over-mocked. The agent's first design proposal had cron synchronously calling the worker. The agent's catch-up logic had the "23 hours down = 22 retries" bug until a human asked the right question.
None of these breakdowns were catastrophic. All would have shipped bugs without human review. The combined workflow is robust because each layer catches what the previous layer missed.
The feature ships. The decision log records the choices. The project prompt grows by a few lines. The team's pattern library has one more example to draw on next time. Six months from now, when someone builds a similar feature, the agent reads the scheduled-exports module, finds the decision log entries, and proposes the right shape on the first pass.
That compounding is what the practice is for. Every well-built feature makes the next one easier. The investment in spec discipline, decision logs, project prompts, and observability pays off not just on the current feature but on every future feature that touches the same areas. The practice gets stronger the longer you do it.
Pick a small feature on your team's backlog. Write the agent-ready spec following Stage 1's format. Don't write any code yet — just have the spec reviewed by a teammate. Note what conversations the spec forced that wouldn't have happened otherwise.
Build a small, real feature using the workflow in this chapter — at least Stages 1 through 6. Keep notes on time spent per stage. Compare to your gut feel before starting. The deltas teach you where your intuitions are calibrated and where they aren't.
Build a more substantial feature using all ten stages. Maintain the decision log throughout. After it ships, hand the decision log to a teammate who didn't work on the feature and ask them to summarise the project. The quality of their summary tells you how good your external memory practice is.
Run a retro on a recent feature your team shipped. Map it to the stages in this chapter. Which stages did you do well? Which did you skip? Which did agents help with, and which were entirely human? Pick one stage to improve next quarter.
You've reached the end. If you've worked the exercises through every chapter, you've built — or at least sketched — a full feature with an agent in the loop, audited your codebase for drift, set up review-bot, written a decision log, and refined a project prompt. That's most of the practice. The rest is the slow accumulation of taste — knowing which agent to reach for, when to interrupt, when to let it run, when to write the test yourself.
None of this is settled. The tools will change. Models will get better at some things and stay weak at others. New patterns will emerge; old ones will fade. The principles in this series — make the implicit explicit, gate the irreversible, observe the reasoning, write down what would otherwise be tacit — should outlast specific tools.
Go build something.
The most useful thing about working alongside an agent isn't the speed. It's that you finally have someone to argue with at 11 p.m. who has read your whole codebase. Treat that gift with seriousness, and the work gets better. Treat it casually, and it costs you. The choice, as always, is yours.
Sign in to join the discussion and post comments.
Sign in