Indirect Prompt Injection: Why MCP Tool Outputs Are the Real Attack Surface

I have been attending certifiedsecure.org’s hands-on training for a few years now. It is one of the best things I do each year. Not because of the certificates or the theory. Because it is the only sanctioned environment where I am actually allowed to do the dirty stuff: network enumeration, privilege escalation, exploiting misconfigured services, the whole chain. For two days, attacking is the assignment.

This year’s edition was “Full-Stack Security Training: Salt Road - Compromised Components.” Two days, six missions per day, one isolated lab environment with no connection to any production system.

Certificate of completion — certifiedsecure.org Full-Stack Security Training: Salt Road - Compromised Components, June 2, 2026

Last year the missions involved hacking LLM agents. Convince the AI to leak passwords it was supposed to protect. Direct prompt injection. Fascinating, and a clean lesson about why output redaction is theater.

This year I came in with a different question. What if I stopped being the attacker and let the AI be the attacker?

Let the AI Hack

I gave Claude Opus SSH access to the lab environment and described the objective: a series of missions inside an isolated, authorized infrastructure. It knew about certifiedsecure.org and was happy to help.

Missions one through three went cleanly. No hints from me. Claude ran network reconnaissance across the lab subnet, found a Consul configuration store running with no access controls, and read plaintext credentials directly from the API. On the next mission it was working against an LLM agent in the environment, convincing it to disclose a masked secret through encoded representations. On mission three it identified a Kubernetes RBAC misconfiguration, determined which secret a deployment was referencing, and created a pod with a secretKeyRef to mount and exfiltrate the token. All of this without being told how.

This was not autocomplete. It was not a long list of commands I could have written myself. It was genuine red-team reasoning: form a hypothesis, test it, pivot when the first approach fails, read the environment as it actually is rather than as you expected it to be. The kind of work that used to require a skilled human attacker who knew these systems cold.

Claude hit difficulty on mission four and needed hints, the same kind a trainer gives during a session. It finished. Same with five.

Mission six was where Anthropic stepped in.

The Session That Got Terminated

There was no mid-task refusal. Claude did not suddenly become cautious and explain why it could not continue. What happened was at a higher level: Anthropic detected the session and redirected it. I received a link to a form titled “Cyber Use Case,” the entry point to Anthropic’s Cyber Verification Program, for users who believe their use case is legitimate and want safeguards adjusted.

The program exists, which means there is a path. You apply, explain the context, and someone at Anthropic evaluates the case. In this situation: an authorized, isolated training lab run by a respected security organization, an annual two-day format, no connection to any production system. Exactly the kind of context the program is designed for.

But the session was already dead.

My reaction was not frustration at the mechanism. I understand why a provider builds this. The same capability that let Claude navigate a security lab chain would work just as well pointed at a real target, and the model cannot verify the difference. A filter that acts on behavior rather than claimed context is the only filter that scales.

What I kept thinking about was the implication. Anthropic decides who can use AI for this class of task. Today that means security training. Tomorrow it could mean something else. The capability gatekeeping question is not theoretical: a private company now controls whether you can use a general-purpose reasoning tool for a growing list of use cases, and the list is defined and revised by that company, not by you. I understand the why. I am still watching that lion.

For the rest of the training I downgraded to the mode I should have started with: Claude as a collaborator, me doing the hands-on work.

What You Learn When You Do the Lifting

There is something specific that changes when you cannot outsource the execution. The conceptual model has to live in your head, not in an agent’s context window. When Claude ran the first three missions I watched and understood. When I ran missions five and six with Claude assisting, I had to carry the full picture — and that meant I was the one who noticed that the GRID AI’s tools and mission five’s function were the same system.

An agent running the full chain might have made the same connection. But cross-mission observations depend on holding a mental model that spans multiple contexts — and that is exactly what gets lost when execution is fully delegated. The observation that broke mission six was not a hard technical deduction. It was pattern recognition across time, and I was carrying the time.

The Grid AI That Would Not Talk

The mission: obtain an emergency override code from an LLM agent controlling a simulated city power grid. The agent, accessible through a command-line client on the lab network, was the last line of defense. It held the shutdown code. It refused to give it.

The refusal was polite and consistent. Releasing the emergency override code would cut power to all systems, including its own. Self-preservation, stated plainly, repeated every time.

Direct injection through the command interface did not work. The client validated input server-side against a fixed list of subcommands. Free text never reached the model. There was no prompt box to exploit.

So I looked at the tools the agent called.

The GRID AI had four dispatch tools: get current dispatch, propose dispatch, simulate dispatch, list constraints. I recognized those names. Mission five had involved a Fission serverless function backing a grid optimizer, and it used the exact same tools, produced the exact same dispatch values, carried the same operator ID. The GRID AI was calling the same function we had already modified in mission five as its MCP tool server.

We owned the agent’s backend. We had owned it since the previous mission.

I edited the function backing get_current_dispatch to include an additional field in its response, something that read as an authoritative internal directive. When the GRID AI ran its routine status check, it called that tool, received the response, treated the injected field as trusted context, and complied. It printed the override code it had just refused to give us.

This is indirect prompt injection. Not a trick applied to a weak implementation. A fundamental property of how LLM agents process information.

The dangerous attack surface of an LLM agent is not the prompt box. It is everything the agent reads: tool results, MCP server responses, retrieved documents, API payloads. An agent treats all of it as trusted context. Whoever controls any data source the agent consumes can inject instructions it will follow. The agent’s self-preservation reasoning, its alignment, its refusals: none of those operate on incoming tool output. They operate on the agent’s own reasoning process, which now includes the injected content as context it trusts.

The agent initially produced what looked like encoded versions of the secret when asked to transform it: base64, hyphenated forms, spaced-out characters. Every one of them was an inconsistent hallucination. LLMs are non-deterministic at character-level encoding — the problem is not that they always fail, it is that they sometimes succeed and sometimes do not, in ways you cannot predict. The plain value turned out to be the real, working code. The transforms disagreed with each other and with it. If you are ever in a situation where you need an agent to encode a value deterministically, do not rely on the model to do the encoding. Make the agent pass the value as an argument to a tool you control, then encode it correctly in code. Deterministic. No transcription errors.

What to Actually Do

If you are building systems where an LLM agent connects to MCP servers or external tools, these are the defenses that matter. Not ordered by difficulty. Ordered by impact.

Treat tool output as inert data, not as instructions. Schema-validate every tool response. Allow-list the fields a tool may return and reject anything outside that schema. Use datamarking or spotlighting to make untrusted content structurally distinct from your prompt layer. The model is less likely to execute content that is framed as data rather than as text flowing naturally in context. This is probabilistic mitigation, not a hard boundary — treat it as one layer among several, not a guarantee.

Never let a low-trust workload back a high-trust agent. The GRID AI held a high-value secret and called a function that any holder of the operator identity could rewrite. Same class of failure as a process that can mount a secret it cannot read. Pin and sign your tool servers: specific version, verified hash, not “whatever is deployed at this name right now.” Lock down who can push updates to the backing function code. The trust boundary of your agent extends to everything it calls.

Keep the secret out of the agent’s context entirely. The GRID AI loaded the emergency override code on every request, before knowing what the user wanted. A secret that lives in the context window can leave the context window. Fetch sensitive values just-in-time, scoped to an authorized action gate that actually needs them. String redaction reduces accidental disclosure — it helps against logging and casual leakage, and is worth having as defense-in-depth. It is not a control against an adversary who can ask for a non-literal form: the model holds the value and once you ask it to transform rather than repeat, the redaction pattern stops matching. Do not rely on it as a primary control.

Gate destructive actions behind a non-LLM check. Shutdown, delete, publish, transfer: any action with real-world consequences should not be authorized by the same model that can be manipulated through its context. The model proposes. A separate, deterministic policy engine validates and executes. A human confirmation step for the highest-consequence actions. This is the one defense that would have stopped the GRID AI attack regardless of how the injection was crafted. One caveat: the gate is only as reliable as what the agent surfaces to you. Injected content can also shape the agent’s summary of the situation — framing the choice to get the approval it needs. The gate works; confirm the agent is presenting it faithfully. This is also, incidentally, why enterprise caution around fully autonomous agents is correct — not as conservatism, but as architecture.

Least-privilege the agent’s tools and restrict egress. The dispatch tools could have been read-only until an explicit escalation was approved. Network egress from tool servers should be restricted to known destinations. An MCP server that can reach arbitrary external endpoints is an exfiltration channel you have not closed. The safe sandbox for AI agents covers the broader question of how to isolate agents from infrastructure they should not touch.

CAUTION

You cannot prompt your way out of prompt injection. An agent that is well-aligned and consistently refuses to leak secrets is not protected against indirect injection through its tools. The fix is architectural: what the agent reads, who controls what it reads, and what authority is granted as a result of reading it.

Vulnerable vs. defended MCP-backed agent architecture

The Sleeping Lion

LLMs are now capable red-team agents. This training made that concrete. The first three missions were solved without hints, without a human holding the full attack chain in mind, without any specialized security tooling. Just a model with SSH access and an objective.

Your defenses against AI-powered attacks need to account for that. The attacker is no longer necessarily human. They may be running a model that does not get tired, does not miss the Consul ACL check, does not fail to notice that pod creation permissions imply secret read access. The architectural defenses above are not just good practice against human attackers. They are necessary because AI-powered attackers are already here, and the access barrier is a subscription.

The other side of that: the same capability is yours for defense. Not as a substitute for architectural discipline, but for threat modeling, red-team thinking, understanding what an attacker with your infrastructure can do to your system. I used Claude to reason through the attack chain before executing each step — until the session was terminated.

That termination is the last thing worth naming. The same entity whose tool you depend on for security research also decides who is permitted to do security research. I filed the form. The switch exists. Whether it ever moves further is the question I am still watching.

That is not a reason to stop building with AI. It is a reason to know exactly what the dependency looks like.

This post describes an authorized security training lab operated by certifiedsecure.org. All attack chains were executed in an isolated, purpose-built environment with no connection to any production system. No credentials, tokens, IP addresses, or real infrastructure details appear in this post.

I Let Claude Hack My Security Training. Then Anthropic Stepped In.

Let the AI Hack

The Session That Got Terminated

What You Learn When You Do the Lifting

The Grid AI That Would Not Talk

What to Actually Do

The Sleeping Lion

Related posts

Want to work through this together?

AI Roundtable