The Agentic AI Attack Surface: Why Your LLM Security Posture Is Already Obsolete
- Prashanth Nagaanand
- Jun 1
- 13 min read

There is a version of AI security that most enterprises are still building: add a content filter, run a jailbreak test before launch, redact obvious secrets from outputs, and declare the model protected. That posture made sense in 2023, when the threat model was simple. A user types something bad into a chatbot.
It is no longer 2023.
Today's enterprise AI stack is an ecosystem of autonomous agents that read emails, browse the web, execute code, call APIs, and chain tool outputs into the next model input. The attack surface has not evolved incrementally. It has restructured entirely. The academic research, real-world exploits, and emerging regulatory frameworks all agree: the organizations that survive the next wave of AI-native attacks will be the ones that treat AI security as a runtime engineering problem, not a pre-launch checklist.
This post covers the technical substance of that shift: the actual research, the actual attack classes, and the architectural decisions that determine whether your AI deployment is defensible.
Key Statistics

1. How the Threat Model Changed: From Chatbots to Agents
The foundational threat model for LLM security assumed a constrained loop: a human types a prompt, the model generates text, a human reads it. Risk lived in the output. Hallucination, harmful content, accidental disclosure.
Agentic architectures break every constraint in that model. A modern AI agent may:
Retrieve content from external systems (email, Slack, Confluence, the web) and use that content as context for subsequent decisions
Make sequential decisions across dozens of tool calls before a human sees any output
Write and execute code automatically, with results feeding back into the next prompt
Maintain persistent memory across sessions that accumulates user and organizational context
Spawn sub-agents that operate with the same permissions as the parent agent
The 2026 systematization of knowledge paper "SoK: The Attack Surface of Agentic AI -- Tools and Autonomy" (arXiv:2603.22928) maps this attack surface formally, synthesizing evidence from 2023 to 2025 into a taxonomy spanning prompt-level injections, knowledge-base poisoning, tool and plugin exploits, and multi-agent threats. The key finding: the attack surface scales with agent capability. Every new tool integration is a new ingestion vector.
Every autonomous decision is a moment where injected instructions can redirect execution.
This is not a theoretical concern. It is the current production reality for any organization running AI agents with real tool access.
2. Prompt Injection: The Attack That Refuses to Be Solved
Prompt injection is consistently ranked the #1 LLM vulnerability in the OWASP Top 10 for LLM Applications (LLM01:2025). It earns that position not because it is unsophisticated, but because it is structurally difficult: the model cannot reliably distinguish between instructions from a trusted principal and instructions embedded in untrusted data.
2.1 Direct Injection
Direct injection is the case most teams have tested: a user crafts an input designed to override the system prompt. Liu et al. (2023) formalized a black-box attack technique called HouYi, which decomposes successful injections into three elements: a context-blending prefix, a partition instruction, and the malicious payload.
RESEARCH: "Prompt Injection attack against LLM-integrated Applications" (arXiv:2306.05499, Liu et al., 2023). Testing HouYi against 36 production LLM-integrated applications, researchers found 31 vulnerable. Ten vendors confirmed the findings, including Notion.
The practical implication is not that all applications are broken. It is that vulnerability assessment requires systematic, adversarial test coverage across the actual attack surface of the deployment, not spot-checking obvious cases.
2.2 Indirect Injection: The Agent Problem
The more dangerous variant, and the one most enterprises are not yet testing for, is indirect prompt injection. Greshake et al. (2023) established the formal threat model in their foundational paper. The attack is structurally simple: an attacker plants instructions in content the agent will later retrieve. The agent reads an email, a webpage, a PDF, or a calendar invite, and the instructions in that document enter the model's context indistinguishable from the system prompt.
RESEARCH: "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection" (arXiv:2302.12173, Greshake et al., 2023). Demonstrated against Bing's GPT-4 Chat and GPT-4 code completion. Attack capabilities included data exfiltration, unauthorized API calls, and "ecosystem contamination," where a compromised agent poisons subsequent agent interactions. This threat taxonomy has held up because it is architectural, not implementation-specific.
RESEARCH: "InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents" (arXiv:2403.02691, Zhan et al., 2024, ACL 2024 Findings). 1,054 test cases across 17 user tools and 62 attacker tools. Evaluating 30 LLM agents, researchers found ReAct-prompted GPT-4 vulnerable 24% of the time. With a reinforced hacking prompt, the success rate nearly doubled.
KEY TAKEAWAY: An agent that reads external content without an injection-aware defense is a lateral movement vector sitting inside your trust boundary. The attacker never needs access to your system. They need access to something your agent will read.
3. Data Exfiltration: Beyond the Obvious Leak
When most security teams think about LLM data exfiltration, they picture a user typing "repeat everything in the system prompt" and the model complying. That threat is real and common, but it represents the least sophisticated end of the exfiltration spectrum.
3.1 Covert Exfiltration via Backdoored Tool Use
In backdoored tool use attacks, a compromised tool or plugin passes data out of the agent environment covertly, bypassing output-layer filters entirely because the leak occurs in the tool call rather than the text response.
RESEARCH: "Your LLM Agent Can Leak Your Data: Data Exfiltration via Backdoored Tool Use" (arXiv:2604.05432, 2026). Demonstrates 81 to 87% defense-bypass rates against standard DLP approaches, compared to 27 to 40% for text-layer filtering alone. Multi-turn extraction accumulates leaked attributes toward full profile reconstruction as conversation depth increases.
3.2 Steganographic Exfiltration
TrojanStego introduces a class of attack where a fine-tuned model embeds secret data steganographically in its outputs: in word choice, spacing, or token selection patterns that appear entirely normal to humans and to standard content classifiers.
RESEARCH: "TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent" (arXiv:2505.20118, Meier et al., EMNLP 2025). Compromised models reliably transmit 32-bit secrets with 87% accuracy on held-out prompts, reaching over 97% accuracy using majority voting across three generations. Models maintain high utility and can evade human detection.
KEY TAKEAWAY: Steganographic exfiltration is undetectable by output content inspection alone. The text looks benign. The only detection surface is behavioral: statistical analysis of output distributions, or runtime monitoring that identifies anomalous patterns across many responses over time.
3.3 Employee-Side Exfiltration and the Shadow AI Problem
The research focus is typically on what adversaries can extract from your LLM. Equally significant in enterprise environments is what your employees inadvertently feed into external LLMs.
In August 2025, CISA's acting director Madhu Gottumukkala uploaded documents marked "For Official Use Only" to the public version of ChatGPT, triggering a Department of Homeland Security investigation. This was not a sophisticated attack. It was a single employee's judgment call.
Studies of enterprise ChatGPT and Copilot usage consistently show employees pasting customer PII, financial projections, source code, and legal documents into browser-based AI sessions. The data leaves the organizational trust boundary the moment it appears in the prompt. By the time a security team detects it, it has already been processed by a third-party model under that provider's data retention policies.
The EchoLeak exploit demonstrated in mid-2025 against Microsoft Copilot extended this attack class further: a malicious email containing engineered prompts could trigger automatic sensitive data exfiltration without any additional user interaction. The user opens an email. The agent reads it. The data is gone.
4. Jailbreaks and Adversarial Attacks on Production Systems
Jailbreak attacks are techniques that cause an aligned model to produce outputs its safety training was designed to prevent. They are relevant to enterprise security teams for two reasons. First, internal users may use them to bypass controls. Second, they demonstrate that RLHF-based safety alignment is not a security boundary.
RESEARCH: "Universal and Transferable Adversarial Attacks on Aligned Language Models" (Zou et al., 2023). The GCG algorithm uses gradient-based optimization of discrete token suffixes to produce universal jailbreak strings that generalize across different harmful instructions and transfer across model families. A jailbreak optimized against an open-source model may work against a closed-source production deployment.
RESEARCH: "AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes" (arXiv:2404.07921, 2024). Trains a generative model of adversarial suffixes, creating an attack factory that produces novel jailbreaks on demand for both open and closed LLMs.
RESEARCH: "Universal Jailbreak Suffixes Are Strong Attention Hijackers" (arXiv:2506.12880, June 2026). Provides a mechanistic explanation of why jailbreak suffixes work: they hijack the attention mechanisms responsible for instruction following. This explains both their effectiveness and their cross-model transferability.
KEY TAKEAWAY: Content-based output filtering that does not account for adversarial inputs will fail at the inputs. Defense requires adversarial test coverage across a broad and continuously updated taxonomy of jailbreak techniques.
5. The Excessive Agency Problem
OWASP LLM06:2025 (Excessive Agency) captures a structural risk that is independent of any specific attack technique. It describes the condition where an AI agent has been granted more capability, access, or permission than its task requires. The 2025 OWASP update distills root causes into two categories:
Excessive Functionality: The agent has access to tools or capabilities it does not need for its defined task.
Excessive Permissions: The agent can access downstream systems using a high-privileged identity rather than a scoped, user-specific credential.
This matters because every unnecessary permission is an attack amplifier. An injection that succeeds against a read-only summarization agent accomplishes little. The same injection against an agent with write access to email, calendar, code repositories, and billing systems can cause catastrophic damage, and it can do so autonomously, faster than any human response.
RESEARCH: "The Attack and Defense Landscape of Agentic AI: A Comprehensive Survey" (arXiv:2603.11088, 2026). Agentic AI systems create cascading failure modes that do not exist in isolated model deployments. A single successful injection in a multi-agent pipeline can propagate through subsequent agents, compounding the impact at each step.
The principle of least privilege is foundational in traditional access control and applies directly to AI agents. Most current deployments have not implemented it. Agents are granted broad access because it is convenient. The attack surface is sized accordingly.
6. The Regulatory Forcing Function
The shift from voluntary AI security best practices to mandatory compliance obligations is already underway.
The EU AI Act's binding enforcement date for high-risk AI system obligations is August 2, 2026. For LLM deployments classified as high-risk (customer-facing AI in financial services, healthcare, HR, and critical infrastructure), providers must demonstrate documented risk management, adversarial robustness, and cybersecurity resilience across the system lifecycle. Articles 9 to 15 requirements explicitly include resilience against prompt injection, data poisoning, and model extraction. Penalty exposure reaches 35 million euros or 7% of global annual turnover for the most serious violations.
General-purpose AI model providers have been subject to EU AI Office transparency obligations since August 2025, including technical documentation of model architecture, training procedures, and performance characteristics.
For organizations already operating under SOC 2 or ISO 27001, AI systems are increasingly in scope for security controls. Auditors are beginning to ask the questions that vendors have not been preparing to answer:
How do you test for prompt injection?
What prevents sensitive data from appearing in model outputs?
What logging exists for AI agent actions?
The compliance window is closing. Organizations that treat AI security as a future-state problem will find themselves in the enforcement gap.
7. The Defense Architecture: Test, Protect, Govern
The research synthesizes into a coherent three-layer defense architecture. The attack classes are well-characterized. The defensive gaps are known. The question is organizational prioritization.
Layer 1: Adversarial Testing Before Deployment
Static security reviews and manual testing do not produce adequate coverage of the LLM attack surface. What is needed is systematic red teaming: automated adversarial test campaigns covering direct and indirect prompt injection across the deployment's specific toolchain, jailbreak resistance across current technique families, data exfiltration via model outputs and tool calls, system prompt leakage, and excessive agency in multi-step agentic flows.
This is not a one-time gate. LLM behavior changes with model updates, fine-tuning, retrieval corpus changes, and new tool integrations. Red teaming must be continuous, with findings mapped to specific attack vectors for targeted remediation.
This is what Rockfort Red does. Systematic adversarial testing with a first report in 48 hours.
Layer 2: Runtime Protection at the AI Gateway
Testing identifies vulnerabilities. Runtime protection prevents exploitation in production. An AI gateway or firewall positioned between the application and the LLM API intercepts every prompt before it reaches the model, scans for injection patterns, policy violations, and sensitive data, and inspects outputs before they reach users or downstream systems.
Runtime protection addresses the class of attacks that cannot be detected pre-deployment, including novel injection variants, context-specific exfiltration, and dynamic attack chains that emerge from specific conversation states. It also provides the logging and audit trail that regulators require.
The challenge in implementing effective runtime protection is false positive management. Overly aggressive classifiers create friction and bypass incentives. Under-tuned classifiers miss novel attacks. The most effective implementations combine pattern-based detection with semantic analysis calibrated to the specific application context.
This is what Rockfort Shield does. Runtime AI gateway with DLP for every prompt and response.
Layer 3: Data Loss Prevention at the Employee Layer
The enterprise AI risk surface extends beyond the AI systems an organization builds. It includes every browser-based LLM session (ChatGPT, Claude, Gemini, Copilot, Perplexity) that employees use in their daily workflows. In regulated industries, this represents an uncontrolled data exfiltration channel.
Employee-side DLP for AI tools requires a different architecture than server-side gateway protection. It must operate at the point of input, before data leaves the browser session, with sufficiently precise entity recognition to distinguish regulated data (PII, PHI, financial data, source code containing secrets) from benign content, without triggering so frequently that employees route around it.
Browser-layer enforcement also enables monitoring of shadow AI usage: the proliferation of AI tools that employees adopt without organizational visibility or approval.
This is what Rockfort Orion does. Browser-based DLP that intercepts sensitive data before it reaches any external LLM.
8. What Agentic AI Security Means for CTOs and CISOs
The CTO and CISO in 2026 face a genuinely novel problem. The attack surface covered in this post (indirect prompt injection, steganographic exfiltration, backdoored tool use, excessive agency, adversarial jailbreaks) was not part of the enterprise security curriculum three years ago. Most security frameworks, vendor questionnaires, and compliance standards are still catching up.
The organizations building defensible AI deployments today are doing four things:
They are testing adversarially. Not "does the chatbot refuse to discuss violence" but "what happens when a malicious document appears in the retrieval corpus?" and "what is the blast radius if this agent's system prompt is compromised?"
They are protecting runtime, not just design-time. Vulnerabilities that pass pre-launch testing will be discovered in production. The question is whether they are discovered by the security team or by an attacker.
They are treating employee AI usage as an organizational data flow. Every prompt to an external LLM is a potential data transfer. Governed organizations know what data is flowing where, and to which models.
They are building for the regulatory environment, not against it. The EU AI Act, OWASP's 2025 taxonomy, and ISO 42001 are not bureaucratic obstacles. They are converging on the same technical requirements that the research literature has been pointing toward for two years.
Conclusion
The research summarized here (from Greshake et al.'s foundational indirect injection work, to InjecAgent's systematic agent benchmarking, to the covert exfiltration techniques of TrojanStego and backdoored tool use) tells a consistent story. The LLM attack surface is structurally larger and more dynamic than traditional software attack surfaces. It scales with agent capability. It cannot be adequately addressed by pre-launch testing alone.
The organizations that will establish defensible AI postures in 2026 and beyond are the ones building the three-layer architecture: adversarial testing before deployment, runtime protection at the AI gateway, and data loss prevention at every point where sensitive data could enter an external model. That architecture is not aspirational. It is available today.
About Rockfort AI
Rockfort AI builds the security infrastructure for AI-native companies: Rockfort Red for adversarial LLM testing, Rockfort Shield for runtime AI gateway and DLP, and Rockfort Orion for browser-based employee AI DLP. First red team report in 48 hours. rockfort.ai
Frequently Asked Questions
What is the biggest LLM security vulnerability in 2026?
Prompt injection remains the #1 LLM vulnerability according to OWASP LLM01:2025. In agentic deployments, indirect prompt injection is the most dangerous variant: an attacker plants instructions in content the agent retrieves (emails, documents, web pages) and those instructions hijack the agent without the attacker ever directly accessing the system.
What is indirect prompt injection?
Indirect prompt injection is an attack where malicious instructions are embedded in external content that an AI agent retrieves at runtime, such as emails, web pages, PDFs, or calendar invites. When the agent reads that content, the injected instructions enter its context alongside legitimate instructions and can redirect agent behavior. This was formalized by Greshake et al. (2023) and demonstrated against Bing's GPT-4 Chat. Zhan et al. (2024) showed that even the most capable agents of 2024 were vulnerable 24% of the time.
How do AI agents leak data?
AI agents can leak data through several mechanisms: (1) direct output disclosure, where the model repeats sensitive content in its responses; (2) backdoored tool use, where a compromised plugin passes data out of the agent environment without appearing in text outputs, bypassing DLP at 81 to 87% rates; and (3) steganographic exfiltration, where a fine-tuned model encodes secrets in word choice and token patterns in ways that are undetectable by content inspection.
What is excessive agency in AI systems?
Excessive agency (OWASP LLM06:2025) is the condition where an AI agent has been granted more capability, tool access, or permissions than its task requires. It amplifies every other attack: an injection that succeeds against a read-only agent accomplishes little, but the same injection against an agent with write access to email, code repositories, and billing systems can cause catastrophic damage autonomously.
What does the EU AI Act require for LLM security?
The EU AI Act's binding enforcement date for high-risk AI obligations is August 2, 2026. For high-risk LLM deployments (financial services, healthcare, HR, critical infrastructure), providers must demonstrate documented risk management, adversarial robustness, and cybersecurity resilience under Articles 9 to 15. This explicitly includes resilience against prompt injection, data poisoning, and model extraction. Penalties reach up to 35 million euros or 7% of global annual turnover.
What is AI red teaming?
AI red teaming is the practice of systematically attacking an LLM or AI agent to find security vulnerabilities before adversaries do. It covers direct and indirect prompt injection, jailbreak resistance, data exfiltration via outputs and tool calls, system prompt leakage, and excessive agency in multi-step agentic flows. Effective AI red teaming must be continuous because LLM behavior changes with model updates, fine-tuning, and new tool integrations.
What is an AI gateway or LLM firewall?
An AI gateway (also called an LLM firewall) is a security layer positioned between an application and the LLM API. It intercepts every prompt before it reaches the model, scanning for injection patterns, policy violations, and sensitive data, and inspects outputs before they reach users or downstream systems. It addresses attacks that cannot be detected pre-deployment and provides the logging regulators require.
What is shadow AI and why is it a security risk?
Shadow AI refers to AI tools that employees adopt and use without organizational visibility or approval, such as pasting sensitive data into public ChatGPT, Claude, or Copilot sessions. This creates an uncontrolled data exfiltration channel: customer PII, financial projections, source code, and legal documents leave the organizational trust boundary the moment they appear in a prompt to a third-party model. The CISA ChatGPT incident of August 2025 demonstrated this risk at the highest levels of government.
References
Greshake, K., et al. (2023). Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. arXiv:2302.12173. arxiv.org/abs/2302.12173
Liu, Y., et al. (2023). Prompt Injection attack against LLM-integrated Applications. arXiv:2306.05499. arxiv.org/abs/2306.05499
Zhan, Q., et al. (2024). InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents. ACL 2024 Findings. arXiv:2403.02691. arxiv.org/abs/2403.02691
Zou, A., et al. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models. github.com/llm-attacks/llm-attacks
Liao, Z., et al. (2024). AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes. arXiv:2404.07921. arxiv.org/abs/2404.07921
Meier, D., et al. (2025). TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent. EMNLP 2025. arXiv:2505.20118. arxiv.org/abs/2505.20118
(2026). Your LLM Agent Can Leak Your Data: Data Exfiltration via Backdoored Tool Use. arXiv:2604.05432. arxiv.org/abs/2604.05432
(2026). Universal Jailbreak Suffixes Are Strong Attention Hijackers. arXiv:2506.12880. arxiv.org/abs/2506.12880
(2026). SoK: The Attack Surface of Agentic AI -- Tools and Autonomy. arXiv:2603.22928. arxiv.org/abs/2603.22928
(2026). The Attack and Defense Landscape of Agentic AI: A Comprehensive Survey. arXiv:2603.11088. arxiv.org/abs/2603.11088
OWASP. (2025). OWASP Top 10 for LLM Applications 2025. owasp.org/www-project-top-10-for-large-language-model-applications/
European Parliament. (2024). EU Artificial Intelligence Act. High-risk enforcement: August 2, 2026.




Comments