Observability was always about answering questions you didn't anticipate when you wrote the logs. With agents in the loop, the questions you'll need to answer are different — and the signals you need are different too. Your existing observability stack still works for the application. But the agent itself is a new component, with its own failure modes, its own debugging surface, and its own memory model. Specifically: it has no memory. Every run starts fresh. Whatever the agent learned yesterday is gone today unless you persisted it. Observability becomes the persistent memory.
This chapter is about adapting observability practice to systems where agents are first-class actors.
Classic observability has three pillars: logs, metrics, traces. For agents, add a fourth — trajectories. Each pillar answers a different question, and the fourth is genuinely new.
Classic services log requests, responses, errors, and significant state changes. Agent runs produce all of those plus a few more event types worth capturing:
Structured logging matters even more here than for services. The fields you want for filtering — task_id, agent_role, model_version, tool_name, cost — should be top-level keys, not buried in message text. Without that, you can't aggregate and you can't alert.
| Metric | Why | Alert if |
|---|---|---|
| Cost per task (p50, p95) | Catches creeping inefficiency | p95 grows >2× week over week |
| Turns per task | Hints at retry loops | Trailing-week p99 doubles |
| Tool error rate by tool | Broken tool ≠ broken agent | Any tool exceeds 5% error rate |
| Approval gate hit rate | How often humans block | Sudden spike or drop |
| Success rate (if defined) | The headline measure | Below role-specific SLO |
| Time-to-completion | UX-facing for interactive agents | p95 exceeds expectation |
"Success rate" is the slippery one. Defining what counts as success for a non-trivial agent is hard, and the definition often changes. Worth doing anyway — even an imperfect success metric is more useful than no metric, because it catches regressions you'd otherwise notice anecdotally weeks later.
Metrics work because the labels have low cardinality — service name, status code, region. Agent metrics can quietly explode cardinality if you're not careful:
The fix is to keep metric labels coarse (agent_role, tool_name, model_version) and put the high-cardinality stuff in trajectories and logs.
The agent's trajectory is the closest thing in this field to a flight recorder. When a task fails or behaves strangely, the trajectory is what you'll read. Storage matters because trajectories are big — a complex task can produce hundreds of thousands of tokens of trajectory. The patterns that work:
The 7-day rule. Trajectories kept for 7 days get read for incidents this week. Trajectories kept for 90 days get read for "is this a recurring pattern" investigations. Trajectories kept for a year are for training and ML analysis. Match retention to who'll read them.
To make this concrete, walk through a hypothetical alert: "AI review-bot has been silent on PRs for the last 4 hours." The agent's last successful run was at 9:14 AM; it's now 1:37 PM.
Step 1 — Read the metrics first. Open the agent dashboard. Cost-per-task is normal. Turns-per-task spiked to around 30 (normally 4–6). Tool error rate is elevated for one tool, the GitHub PR fetcher, at 80% errors over the last 4 hours. Hypothesis: GitHub API is throttling. The agent is retrying, looping, and failing silently.
Step 2 — Read a representative trajectory. Pick one failed run from the last hour. The trajectory shows the agent calling the PR fetcher, getting a rate-limit error, retrying, getting another rate-limit error, retrying — for 28 turns before giving up. Hypothesis confirmed: the underlying issue is GitHub rate limiting, and the agent's retry behavior was reasonable but didn't escalate visibly.
Step 3 — Find the root cause. The metric "tool error rate by tool" spiked at 9:30 AM. Check the GitHub status page — known incident affecting the API.
Step 4 — Decide what to fix. Immediate: reduce the agent's polling rate. Short-term: add escalation behavior — after N retries, post a status comment so users aren't left in the dark. Longer-term: add an alert on tool error rate exceeding 10% sustained for 15 minutes — would have paged at 9:45 AM instead of 1:37 PM.
This is the same shape of postmortem you'd write for any service incident — but the trajectory makes the agent's "thinking" legible in a way that log-only debugging never would.
Session replay — re-running a problematic agent session with the same inputs to study its behavior — is one of the highest-leverage debugging techniques available. It works disproportionately well for agents because agent runs are mostly deterministic at the system level: same inputs, same model version, same prompts → similar outputs (modulo model nondeterminism).
The technique: capture the full input state at task start (system prompt, instructions, context files); replay the task using the same captured state; modify one variable at a time (the prompt, a tool description, an instruction) and re-run; compare trajectories to see which variable mattered.
This turns anecdote ("the agent did this weird thing once") into evidence ("the agent does this weird thing 80% of the time when the instruction is phrased this way, and never when phrased that way"). It's the empirical method, applied to a notoriously empirical domain.
When an agent acts on your application — deploys code, calls an internal API, writes to a database — you want to follow the work across both systems. The pattern that works: propagate a trace ID from the agent's run into every downstream call, typically via a header like X-Agent-Task-Id.
On the receiving side, log that header alongside the existing request logs. Downstream, when you investigate a request, you can trace back to which agent task triggered it and read that trajectory.
This is what lets you answer questions like "did the agent cause this incident?" with seconds of investigation instead of hours. The investment is small. The payoff shows up the first time something goes wrong.
Set up logging for one agent of yours — even a personal one. Capture the six event types listed earlier. Run a few tasks. Open the logs after. Is the question "what did the agent do for this task?" answerable in 30 seconds? If not, what's missing?
For one production agent role, list the metrics you currently track. Cross-reference against the table earlier in this chapter. Which metrics are you missing, and which existing metrics aren't useful? Add the missing ones; consider deprecating the noisy ones.
Pick a recent agent failure you debugged. Time how long it took. Then ask: with full observability (trajectories, cross-correlation, metrics) what would the debugging time have been? The delta is the ROI of the investment you haven't made yet.
Next chapter: Maintenance — the long tail. Six months in, when half your codebase has agent-touched files, what does the practice look like?
Sign in to join the discussion and post comments.
Sign in