There's a meaningful step from "agent helps me write code on my laptop" to "agent runs in CI, has secrets, can deploy." The first is low-stakes — worst case, you don't commit the bad code. The second is the reason security teams exist. The agent isn't running with your judgment behind every action anymore; it's running on a budget, against a clock, with credentials. This chapter is about doing that responsibly, without falling for either failure mode — locking everything down so tightly the agent is useless, or trusting the agent the way you'd trust a senior engineer.
This is the chapter where the honest threat model matters most. The decisions made here determine whether the organisation gets the benefits of agentic SDLC or just the risks.
"Agent goes rogue" is the dramatic threat. It's also the least likely one. The realistic threats, ranked roughly in the order you'll encounter them:
| # | Threat | Likelihood | Impact |
|---|---|---|---|
| 1 | Prompt injection through ingested content (file, web page, ticket) | High | Variable |
| 2 | Cost runaway from retry loops | High | Low (money, not data) |
| 3 | Scope creep on legitimate tasks | High | Low–Medium |
| 4 | Credential mishandling — secret in log, commit, or output | Medium | High |
| 5 | Compromised dependency, tool server, or model provider | Low | High |
| 6 | True misalignment — model acts against instructions | Very low | High |
Most of the patterns in this chapter aim at threats 1–4. Threats 5 and 6 require organisational measures — vendor selection, monitoring, incident response — more than tactical ones.
The principle of least authority — give the agent the minimum permissions needed for its task, nothing more — is widely cited and inconsistently practiced. Here's what it looks like for the common agent roles:
| Role | Read | Write | Credentials |
|---|---|---|---|
| Development agent (laptop) | Project dir only | Sandboxed shell, project dir | None default; per-task only |
| CI review agent | PR + diff + repo | PR comments only | Read-only repo token |
| CI fix agent | Repo + failure logs | Own branch only, draft PR | Branch-scoped write token |
| Deploy agent | Build artifacts | Pre-approved environments only | Short-lived, per-deploy |
The simplest pattern in 2026 for the deploy agent: don't have one. Have an agent that proposes deploys (writes a deploy plan, opens a PR-like artifact), and a human or a deterministic pipeline that executes them. The marginal value of an autonomous deploy agent rarely justifies the marginal risk.
The naive approach to safety is to require approval for everything. This works for about a week, after which approval fatigue sets in, people start rubber-stamping, and you've created a worse system than no gates at all. The trick is gating the right things.
The framework: classify every agent action by reversibility and blast radius.
| Small blast radius | Large blast radius | |
|---|---|---|
| Reversible | No approval (edit a file, run a test) | Log + notify (open a PR, draft an email) |
| Irreversible | Log + notify (send a non-critical webhook) | Explicit approval (deploy, delete data, charge a card) |
The "explicit approval" cell is small by design. Most agent actions belong in the top-left. Putting them in the bottom-right is what causes approval fatigue.
Reversibility ≠ undoability. "We have backups" doesn't make a deletion reversible if the restore takes hours and loses data in the meantime. A truly reversible action is one you can undo in seconds with no side effects. Use the strict definition when designing gates.
An agent that reads — files, web pages, ticket comments, vendor docs — can be told what to do by whatever it reads. This is not a bug to be patched; it's a structural property of how LLMs work. Defenses are about minimising impact, not eliminating the vulnerability.
The defenses, in order of effectiveness:
The defenses compose. Each one alone is partial; together they're robust enough that injection becomes annoying rather than catastrophic.
Background agents in CI fall into three patterns, in order of risk and value.
Lowest risk, highest immediate value. Agent reads every PR and leaves comments. Cannot modify code, approve, or merge. The PR author treats the agent's comments like any other reviewer's — some lead to changes, some get dismissed. Permissions: read code, write PR comments. That's it.
Higher risk, often higher value. Agent runs on CI failures, proposes fixes, opens a new draft PR (not a push to the original branch). Three properties make this safe: drafts require explicit promotion to "ready," the branch namespace is visible at a glance, and the bot cannot merge.
Highest risk, rarely worth it. If you have one, it should be doing routine deploys (post-merge to staging) and explicitly blocked from production deploys without human approval. Most teams find their existing CI pipelines already do this fine without an agent. The added complexity doesn't carry its weight.
The pattern to internalize: start with review-bot. Add fix-bot only after review-bot has been working cleanly for at least a quarter. Approach deploy-bot only if there's a specific problem with the existing pipeline that an agent would solve — and even then, prefer extending the pipeline.
When something goes wrong with an agent — a bad PR shipped, a budget blown, a tool called surprisingly — you need to reconstruct what happened. The minimum useful audit trail includes:
Tag the trail by agent role and task ID so you can find "all the times the deploy-bot touched prod last quarter" with one query. Retention of 90 days is a common minimum; for high-stakes agents, longer.
The "Monday morning" test. Imagine a critical bug is discovered Monday morning and you suspect an agentic change last week. Can you, in fifteen minutes, identify which agent run, what it was asked to do, what it actually did, and who approved it? If not, your audit trail isn't doing its job yet.
Cost runaway is the most common incident with production agents. Not catastrophic — your bill is bigger than expected, not your data leaked — but real money and real annoyance. The guardrails that work:
These don't prevent runaway; they bound it. The first time an agent gets into a 10K-turn retry loop, the per-task budget saves your bill and gives you a clear signal to investigate.
Three rules, almost universally:
The technical patterns above are the easy part. The harder part is the conversation about where you place the trust boundary, and that conversation has to happen with security, legal, and engineering leadership in the same room.
The decisions to make explicitly, before the first production agent runs:
Teams that have these conversations early ship agents with confidence. Teams that don't, ship them anyway and then have the conversations during the first incident.
Start small. Set up a review-bot on a personal repo with the strictest possible permissions. Watch for a week. Refine the project prompt based on what you see. Don't add fix-bot or anything more powerful until review-bot has been stable for at least a month.
Audit one agent's permissions. Write down what it can do; write down what it needs; the delta is your cleanup. Apply the smallest reductions you can without breaking things, weekly. Most agents are over-permitted; few are under-permitted.
Convene the conversation listed above — security, legal, engineering — for one hour. Don't try to solve everything. Just get the questions named and an owner assigned to each. The hour pays for itself the first time a production incident requires the answers.
Next chapter: Observability — logging, tracing, and debugging when half the changes are made by something that doesn't remember yesterday.
Sign in to join the discussion and post comments.
Sign in