On this tutorial

Agentic SDLC: A Field Manual for Building Software with AI Agents

Foundations

Phases

Synthesis

Capstone

Capstone — a feature, end to end

Observability — debugging the unrepeatable

Observability was always about answering questions you didn't anticipate when you wrote the logs. With agents in the loop, the questions you'll need to answer are different — and the signals you need are different too. Your existing observability stack still works for the application. But the agent itself is a new component, with its own failure modes, its own debugging surface, and its own memory model. Specifically: it has no memory. Every run starts fresh. Whatever the agent learned yesterday is gone today unless you persisted it. Observability becomes the persistent memory.

This chapter is about adapting observability practice to systems where agents are first-class actors.

What you'll take away from this chapter

The four signals you need from every agent run — and how they extend the classic three-pillar model
The metrics worth tracking and the cardinality trap that ruins them if you're not careful
How to debug an agent failure that the agent itself can't remember
The "session replay" technique — and why it works disproportionately well for agent debugging
What it means to cross-correlate agent traces with application traces, and why it shortens incident response from hours to minutes

The four agent signals

Classic observability has three pillars: logs, metrics, traces. For agents, add a fourth — trajectories. Each pillar answers a different question, and the fourth is genuinely new.

The fourth pillar is what's new. Trajectories sit alongside the classic three; they don't replace any of them.

What to log about an agent

Classic services log requests, responses, errors, and significant state changes. Agent runs produce all of those plus a few more event types worth capturing:

Task received — full input, including the resolved system prompt and any context loaded
Tool call — name, arguments, latency, result or error
Model call — model version, tokens in/out, latency, finish reason
Decision point — when the agent had multiple paths and picked one
Approval gate — what was proposed, who approved, when
Task complete or aborted — final outcome, with cost and time totals

Structured logging matters even more here than for services. The fields you want for filtering — task_id, agent_role, model_version, tool_name, cost — should be top-level keys, not buried in message text. Without that, you can't aggregate and you can't alert.

Metrics worth tracking

Metric	Why	Alert if
Cost per task (p50, p95)	Catches creeping inefficiency	p95 grows >2× week over week
Turns per task	Hints at retry loops	Trailing-week p99 doubles
Tool error rate by tool	Broken tool ≠ broken agent	Any tool exceeds 5% error rate
Approval gate hit rate	How often humans block	Sudden spike or drop
Success rate (if defined)	The headline measure	Below role-specific SLO
Time-to-completion	UX-facing for interactive agents	p95 exceeds expectation

"Success rate" is the slippery one. Defining what counts as success for a non-trivial agent is hard, and the definition often changes. Worth doing anyway — even an imperfect success metric is more useful than no metric, because it catches regressions you'd otherwise notice anecdotally weeks later.

The cardinality trap

Metrics work because the labels have low cardinality — service name, status code, region. Agent metrics can quietly explode cardinality if you're not careful:

Per-task labels. Don't add task_id as a label; it goes in logs and traces.
Free-form user input. Don't label by user message content.
Tool args. Tool name is fine; arg values are not.

The fix is to keep metric labels coarse (agent_role, tool_name, model_version) and put the high-cardinality stuff in trajectories and logs.

Trajectories as the fourth pillar

The agent's trajectory is the closest thing in this field to a flight recorder. When a task fails or behaves strangely, the trajectory is what you'll read. Storage matters because trajectories are big — a complex task can produce hundreds of thousands of tokens of trajectory. The patterns that work:

Always-on capture, sampled retention. Capture every trajectory; keep all of them for 7 days; keep failed and sampled ones for 90+.
Compressed storage. Trajectories compress 5–10× with standard compression. Cheap.
Indexed metadata, full-text body. Index on task_id, agent_role, outcome, model, cost. Store the body as searchable text.
Link from logs to trajectories. Every log entry should carry the trajectory ID so you can jump from "this looked off" to "show me the full reasoning."

The 7-day rule. Trajectories kept for 7 days get read for incidents this week. Trajectories kept for 90 days get read for "is this a recurring pattern" investigations. Trajectories kept for a year are for training and ML analysis. Match retention to who'll read them.

Debugging a real agent failure

To make this concrete, walk through a hypothetical alert: "AI review-bot has been silent on PRs for the last 4 hours." The agent's last successful run was at 9:14 AM; it's now 1:37 PM.

Step 1 — Read the metrics first. Open the agent dashboard. Cost-per-task is normal. Turns-per-task spiked to around 30 (normally 4–6). Tool error rate is elevated for one tool, the GitHub PR fetcher, at 80% errors over the last 4 hours. Hypothesis: GitHub API is throttling. The agent is retrying, looping, and failing silently.

Step 2 — Read a representative trajectory. Pick one failed run from the last hour. The trajectory shows the agent calling the PR fetcher, getting a rate-limit error, retrying, getting another rate-limit error, retrying — for 28 turns before giving up. Hypothesis confirmed: the underlying issue is GitHub rate limiting, and the agent's retry behavior was reasonable but didn't escalate visibly.

Step 3 — Find the root cause. The metric "tool error rate by tool" spiked at 9:30 AM. Check the GitHub status page — known incident affecting the API.

Step 4 — Decide what to fix. Immediate: reduce the agent's polling rate. Short-term: add escalation behavior — after N retries, post a status comment so users aren't left in the dark. Longer-term: add an alert on tool error rate exceeding 10% sustained for 15 minutes — would have paged at 9:45 AM instead of 1:37 PM.

This is the same shape of postmortem you'd write for any service incident — but the trajectory makes the agent's "thinking" legible in a way that log-only debugging never would.

Session replay for agents

Session replay — re-running a problematic agent session with the same inputs to study its behavior — is one of the highest-leverage debugging techniques available. It works disproportionately well for agents because agent runs are mostly deterministic at the system level: same inputs, same model version, same prompts → similar outputs (modulo model nondeterminism).

The technique: capture the full input state at task start (system prompt, instructions, context files); replay the task using the same captured state; modify one variable at a time (the prompt, a tool description, an instruction) and re-run; compare trajectories to see which variable mattered.

This turns anecdote ("the agent did this weird thing once") into evidence ("the agent does this weird thing 80% of the time when the instruction is phrased this way, and never when phrased that way"). It's the empirical method, applied to a notoriously empirical domain.

Cross-correlation

When an agent acts on your application — deploys code, calls an internal API, writes to a database — you want to follow the work across both systems. The pattern that works: propagate a trace ID from the agent's run into every downstream call, typically via a header like X-Agent-Task-Id.

On the receiving side, log that header alongside the existing request logs. Downstream, when you investigate a request, you can trace back to which agent task triggered it and read that trajectory.

This is what lets you answer questions like "did the agent cause this incident?" with seconds of investigation instead of hours. The investment is small. The payoff shows up the first time something goes wrong.

Practice — before you read the next chapter

If you're new to this

Set up logging for one agent of yours — even a personal one. Capture the six event types listed earlier. Run a few tasks. Open the logs after. Is the question "what did the agent do for this task?" answerable in 30 seconds? If not, what's missing?

If you're running agents in production

For one production agent role, list the metrics you currently track. Cross-reference against the table earlier in this chapter. Which metrics are you missing, and which existing metrics aren't useful? Add the missing ones; consider deprecating the noisy ones.

If you lead a team

Pick a recent agent failure you debugged. Time how long it took. Then ask: with full observability (trajectories, cross-correlation, metrics) what would the debugging time have been? The delta is the ROI of the investment you haven't made yet.

Takeaways

Four pillars now: logs, metrics, traces, trajectories. The fourth is new and load-bearing.
Log task events with structured fields. Keep cardinality in mind; high-cardinality data goes in trajectories, not metrics.
Metrics that matter: cost, turns, tool errors, approval rates, success rate, time-to-completion. Alert on the right ones with the right thresholds.
Capture every trajectory; sample retention. Index metadata, keep bodies searchable.
Session replay turns anecdote into evidence. Use it whenever you have a recurring weirdness.
Propagate agent task IDs into application traces. The cross-correlation is what makes incident triage fast.

Next chapter: Maintenance — the long tail. Six months in, when half your codebase has agent-touched files, what does the practice look like?

Discussion

DevOps — agents with credentials Maintenance — the long tail