On this tutorial

Agentic SDLC: A Field Manual for Building Software with AI Agents

Foundations

Phases

Synthesis

Capstone

Capstone — a feature, end to end

Testing — when the agent writes the tests

Hand a task to an agent that involves writing tests, and you'll get tests. They'll be well-formatted. They'll have descriptive names. They'll pass. They might also test absolutely nothing of value — asserting that the code does what it does, looping implementation choices into the assertions, mocking out the very thing the test was meant to verify. The agent isn't trying to deceive you. It's pattern-matching against test files it has seen, and most test files are full of low-value tests too.

This chapter is about the testing discipline that catches the agent's specific failure modes before they ship, and the workflow that makes that discipline sustainable.

What you'll take away from this chapter

The four families of agent-generated test failures and how to spot them in seconds
The "test the spec, not the code" rule and how to enforce it in practice
Why mutation testing is suddenly worth running, and what changed
How to use the agent to review tests separately from writing them
The shape of the test pyramid that holds up best when agents write most of the tests

The four families of weak agent tests

Almost every weak test an agent writes falls into one of four patterns. Once you can name them, you can spot them in seconds.

Family 1 — Tautological tests

The test asserts that the code does what the code does. The setup defines the expected value; the function produces the same value because it's the function that produced the setup. A test for a "formatPrice" function that calls formatPrice twice and asserts the results are equal — that's the shape. The test cannot fail unless the function disappears entirely.

Family 2 — Implementation-coupled tests

The test asserts on the internal structure of a class or module — that a cache field is defined, that a private helper is called, that a key prefix is exactly "user:". Every refactor — even a beneficial one — breaks these tests. They constrain the implementation rather than the behavior, and they survive the regression suite by being too detailed to fail unless you happen to change something they're coincidentally checking.

Family 3 — Over-mocked tests

The test "verifies" a function by replacing everything inside it with mocks, then asserting the mocks were called. A test for "processOrder" that mocks the charge function and the email function, then asserts both were called — tells you the function calls its dependencies. It tells you nothing about whether the charge amount is correct, whether the receipt has the right contents, whether the order is marked as paid, or whether the email is sent after the charge succeeded (a real bug if reversed).

Family 4 — Fixture-coupled tests

The fixture defines the expected output of the function under test. Both were written at the same time; both will be updated together if the function changes. The test isn't checking a specification — it's checking that fixture and function stay synchronised, which they will, by your hand.

The unifying property of every weak test pattern: it passes regardless of correctness. A test that doesn't gate behavior isn't a test, it's noise.

The test-the-spec rule

If you remember one thing from this chapter: tests assert behavior described in the spec, not the code as written. The test should be writable from the spec alone, without looking at the implementation.

Practical consequence: when prompting an agent to write tests, give it the spec, not the implementation. If you don't have a spec, the test will end up coupled to the code by default, because that's all the agent has to go on. The pattern that works is: human supplies the spec, agent writes tests against it without reading the implementation file, human reviews both.

This is uncomfortable for teams that have never written specs explicit enough to be testable. The good news is that this discipline pays off twice — once at the spec phase (Ch. 02), once at the testing phase. The same artifact does both jobs.

The "what does this test break on?" check

The fastest way to evaluate a test is to ask: what change to the production code would break this test?

Good tests have an answer like "if we accidentally returned the gross instead of the net," or "if we charged before validating the card." The answer points at a real bug the test would catch.

Weak tests have answers like "if we changed the variable name," or "if we deleted this function entirely." The first is noise; the second is technically true but the test isn't earning its keep.

The fragility ratio. A test is healthy when it breaks more often from real bugs than from refactors. Track this in your head as you review. If you find yourself updating a test every time you refactor surrounding code, the test is over-coupled and worth rewriting.

Mutation testing earns its keep now

Mutation testing — running your test suite against deliberately-broken versions of the code, to see which broken versions the tests catch — was historically too slow to use widely. With agents in the loop, two things change. First, the agent can generate sensible mutations of your code, which is faster than the classic AST-based mutators. Second, the agent can run the test suite against each mutant and report which mutants survive — which is exactly the analysis you want.

The workflow: ask the agent to produce five small, plausible bugs an engineer might accidentally introduce in a given module, then run the test suite against each. Anything the suite doesn't catch is a gap in your assertions. Anything it does catch is evidence the test was earning its keep.

Two surviving mutants out of five is borderline. Three or more would be a smell. The point isn't a number; the point is that mutation testing surfaces gaps in your assertions that no other tool reliably finds. Use it on your critical modules every quarter. Skip the rest.

Using the agent to review tests, not just write them

The same pattern-matching that makes agents prone to writing weak tests makes them good at spotting weak tests when you ask explicitly. Treat test review as a separate task from test writing.

The prompt is simple: review the tests in this file; for each test, tell me what production-code change would cause it to fail; flag any tests where the answer is "no real bug would cause this to fail" or "only a structural refactor would break this."

What you get back is a per-test verdict — strong, weak, or somewhere in between, with the reason named. The output reads like a code review from a careful colleague who has time you don't.

This kind of review used to happen, badly, during code review of the PR itself — where reviewers might or might not have the patience to evaluate each test individually. Agents do it consistently. Worth running on any test file with more than a handful of tests, especially ones inherited from someone else.

The test pyramid, slightly tilted

The classic test pyramid — lots of unit tests, fewer integration tests, very few end-to-end tests — still holds, but with a small tilt now. Integration tests have become more valuable per test, because the failure modes they catch (a bad contract between two modules, a regression in a third-party library, a missing migration) are exactly the failures an agent-driven change is most likely to introduce.

Layer	Older guidance	2026 mix	Why the shift
Unit	~70%	~60%	Still the backbone — fast, focused on pure logic
Integration	~20%	~30%	Catches contract drift that agent diffs introduce
End-to-end	~10%	~10%	Brittle either way; only for critical user flows

The shift up in integration tests pays for itself the first time an agent's "looks fine in isolation" change breaks a contract between two modules.

Test-driven development, when the agent is the developer

TDD with an agent is genuinely different from TDD by hand. The agent has no muscle memory; the discipline of "test first" has to be enforced by the harness, not by the developer's habit. The pattern that works:

Human writes the test, expressing the behavior from the spec. The test fails because the implementation doesn't exist.
Agent writes the implementation with the instruction: "Make this test pass. Don't modify the test."
Human reviews both the implementation and any temptation the agent had to "fix" the test instead of the code.

Step 1 forces the spec to be concrete. Step 2 makes the agent's job narrow and unambiguous. The rule in step 3 prevents the most common agent shortcut: changing the test until it passes.

The "don't modify the test" rule belongs in your prompt explicitly. Without it, an agent will sometimes "fix" a test that asserts behavior it can't easily implement. With it, the agent is forced to actually implement the behavior — or come back and ask if the test is wrong. Either outcome is useful; silent test-rewriting is not.

Practice — before you read the next chapter

If you're new to this

Take a test file from your codebase that you trust. Pick three random tests in it. For each, write down — without looking — what production-code change would cause it to fail. Then look at the test. Were you right? The disconnect, when it appears, reveals tests that don't actually constrain what they claim to.

If you're using agents already

Pick a recent feature you shipped with agent help. Ask a fresh agent session to review the tests in that feature using the review prompt described above. Count: how many tests would you remove or rewrite given the review?

If you lead a team

Pick the test file with the highest line count in your codebase. It's probably also one of the least-trusted by your team. Run an agent review on it. Use the review as a basis for a focused cleanup session — usually 30–50% of tests in such files can be deleted with no loss of coverage, and the remaining tests get stronger.

Takeaways

Four families of weak agent tests: tautological, implementation-coupled, over-mocked, fixture-coupled. All share the property of passing regardless of correctness.
Test the spec, not the code. Prompt the agent with the spec; withhold the implementation.
For each test, ask: what production-code change would break this? If the answer points at a real bug, the test earns its keep.
Mutation testing is finally cheap enough to run on critical modules quarterly. Do it.
Have the agent review tests separately from writing them. It's good at the review task even when it would have written weak tests originally.
For TDD with agents: human writes the test, agent writes the implementation, and the prompt explicitly forbids modifying the test.

Next chapter: DevOps — agents in CI, agents holding deploy keys, and the honest threat model. When the agent goes from advisor to actor, the security calculus changes. We'll walk through what to lock down, what to leave open, and the patterns that have actually worked.

Discussion

Coding — the day-to-day with an agent DevOps — agents with credentials