Hand a task to an agent that involves writing tests, and you'll get tests. They'll be well-formatted. They'll have descriptive names. They'll pass. They might also test absolutely nothing of value — asserting that the code does what it does, looping implementation choices into the assertions, mocking out the very thing the test was meant to verify. The agent isn't trying to deceive you. It's pattern-matching against test files it has seen, and most test files are full of low-value tests too.
This chapter is about the testing discipline that catches the agent's specific failure modes before they ship, and the workflow that makes that discipline sustainable.
Almost every weak test an agent writes falls into one of four patterns. Once you can name them, you can spot them in seconds.
The test asserts that the code does what the code does. The setup defines the expected value; the function produces the same value because it's the function that produced the setup. A test for a "formatPrice" function that calls formatPrice twice and asserts the results are equal — that's the shape. The test cannot fail unless the function disappears entirely.
The test asserts on the internal structure of a class or module — that a cache field is defined, that a private helper is called, that a key prefix is exactly "user:". Every refactor — even a beneficial one — breaks these tests. They constrain the implementation rather than the behavior, and they survive the regression suite by being too detailed to fail unless you happen to change something they're coincidentally checking.
The test "verifies" a function by replacing everything inside it with mocks, then asserting the mocks were called. A test for "processOrder" that mocks the charge function and the email function, then asserts both were called — tells you the function calls its dependencies. It tells you nothing about whether the charge amount is correct, whether the receipt has the right contents, whether the order is marked as paid, or whether the email is sent after the charge succeeded (a real bug if reversed).
The fixture defines the expected output of the function under test. Both were written at the same time; both will be updated together if the function changes. The test isn't checking a specification — it's checking that fixture and function stay synchronised, which they will, by your hand.
If you remember one thing from this chapter: tests assert behavior described in the spec, not the code as written. The test should be writable from the spec alone, without looking at the implementation.
Practical consequence: when prompting an agent to write tests, give it the spec, not the implementation. If you don't have a spec, the test will end up coupled to the code by default, because that's all the agent has to go on. The pattern that works is: human supplies the spec, agent writes tests against it without reading the implementation file, human reviews both.
This is uncomfortable for teams that have never written specs explicit enough to be testable. The good news is that this discipline pays off twice — once at the spec phase (Ch. 02), once at the testing phase. The same artifact does both jobs.
The fastest way to evaluate a test is to ask: what change to the production code would break this test?
Good tests have an answer like "if we accidentally returned the gross instead of the net," or "if we charged before validating the card." The answer points at a real bug the test would catch.
Weak tests have answers like "if we changed the variable name," or "if we deleted this function entirely." The first is noise; the second is technically true but the test isn't earning its keep.
The fragility ratio. A test is healthy when it breaks more often from real bugs than from refactors. Track this in your head as you review. If you find yourself updating a test every time you refactor surrounding code, the test is over-coupled and worth rewriting.
Mutation testing — running your test suite against deliberately-broken versions of the code, to see which broken versions the tests catch — was historically too slow to use widely. With agents in the loop, two things change. First, the agent can generate sensible mutations of your code, which is faster than the classic AST-based mutators. Second, the agent can run the test suite against each mutant and report which mutants survive — which is exactly the analysis you want.
The workflow: ask the agent to produce five small, plausible bugs an engineer might accidentally introduce in a given module, then run the test suite against each. Anything the suite doesn't catch is a gap in your assertions. Anything it does catch is evidence the test was earning its keep.
Two surviving mutants out of five is borderline. Three or more would be a smell. The point isn't a number; the point is that mutation testing surfaces gaps in your assertions that no other tool reliably finds. Use it on your critical modules every quarter. Skip the rest.
The same pattern-matching that makes agents prone to writing weak tests makes them good at spotting weak tests when you ask explicitly. Treat test review as a separate task from test writing.
The prompt is simple: review the tests in this file; for each test, tell me what production-code change would cause it to fail; flag any tests where the answer is "no real bug would cause this to fail" or "only a structural refactor would break this."
What you get back is a per-test verdict — strong, weak, or somewhere in between, with the reason named. The output reads like a code review from a careful colleague who has time you don't.
This kind of review used to happen, badly, during code review of the PR itself — where reviewers might or might not have the patience to evaluate each test individually. Agents do it consistently. Worth running on any test file with more than a handful of tests, especially ones inherited from someone else.
The classic test pyramid — lots of unit tests, fewer integration tests, very few end-to-end tests — still holds, but with a small tilt now. Integration tests have become more valuable per test, because the failure modes they catch (a bad contract between two modules, a regression in a third-party library, a missing migration) are exactly the failures an agent-driven change is most likely to introduce.
| Layer | Older guidance | 2026 mix | Why the shift |
|---|---|---|---|
| Unit | ~70% | ~60% | Still the backbone — fast, focused on pure logic |
| Integration | ~20% | ~30% | Catches contract drift that agent diffs introduce |
| End-to-end | ~10% | ~10% | Brittle either way; only for critical user flows |
The shift up in integration tests pays for itself the first time an agent's "looks fine in isolation" change breaks a contract between two modules.
TDD with an agent is genuinely different from TDD by hand. The agent has no muscle memory; the discipline of "test first" has to be enforced by the harness, not by the developer's habit. The pattern that works:
Step 1 forces the spec to be concrete. Step 2 makes the agent's job narrow and unambiguous. The rule in step 3 prevents the most common agent shortcut: changing the test until it passes.
The "don't modify the test" rule belongs in your prompt explicitly. Without it, an agent will sometimes "fix" a test that asserts behavior it can't easily implement. With it, the agent is forced to actually implement the behavior — or come back and ask if the test is wrong. Either outcome is useful; silent test-rewriting is not.
Take a test file from your codebase that you trust. Pick three random tests in it. For each, write down — without looking — what production-code change would cause it to fail. Then look at the test. Were you right? The disconnect, when it appears, reveals tests that don't actually constrain what they claim to.
Pick a recent feature you shipped with agent help. Ask a fresh agent session to review the tests in that feature using the review prompt described above. Count: how many tests would you remove or rewrite given the review?
Pick the test file with the highest line count in your codebase. It's probably also one of the least-trusted by your team. Run an agent review on it. Use the review as a basis for a focused cleanup session — usually 30–50% of tests in such files can be deleted with no loss of coverage, and the remaining tests get stronger.
Next chapter: DevOps — agents in CI, agents holding deploy keys, and the honest threat model. When the agent goes from advisor to actor, the security calculus changes. We'll walk through what to lock down, what to leave open, and the patterns that have actually worked.
Sign in to join the discussion and post comments.
Sign in