Most enterprise application security programs were designed for deterministic systems. LLM-powered applications break enough of those assumptions that a meaningful share of their attack surface goes uncovered by existing controls. The gap is widening as agentic patterns add tool-calling and external action capabilities to what used to be read-only chat interfaces.

The pace of deployment has outrun the pace of control maturity. Gartner's 2024 board survey found that 55% of organizations are piloting or in production with generative AI, while internal security review processes for AI features in our client base typically lag the first production deployment by one to two quarters. The result is a meaningful population of LLM features running in production without a documented threat model, an explicit privilege boundary, or LLM-specific monitoring. In our work across financial services, healthcare, and SaaS engagements, this gap is the single most common finding.

What is new in the threat model

Five categories of risk are either new or substantially amplified in LLM-powered applications. The OWASP Top 10 for LLM Applications provides a useful taxonomy, but in practice the categories below carry the most weight in real incidents:

The privilege boundary problem

The single most consequential design decision in an LLM-powered application is where the privilege boundary sits. In a chat assistant that only generates text, the boundary is at the output: the LLM can say anything, but only the user acts on its words. In an agentic system, the boundary moves to the tool layer: the LLM can act, and the tools' permissions become the effective security perimeter.

Most production incidents we have investigated trace back to a privilege boundary that was implicit rather than designed. The remediation is almost always the same: explicit allowlists for what tools can be invoked under what conditions, with the LLM treated as untrusted code rather than as part of the application logic. The mental model that works is the one Simon Willison has articulated repeatedly: the LLM is a confused deputy, and your architecture must assume any input it sees is potentially adversarial.

A counter-take worth stating plainly: we disagree with the prevailing vendor message that better models or stronger system prompts will close the prompt-injection gap. They will not. Empirical evaluations of injection defenses, including Anthropic's constitutional classifiers work and academic red-team studies, consistently show attack success rates that remain non-trivial even against the strongest production defenses. Treating the model itself as a security control is the wrong frame. The right frame is the one used for sandboxing untrusted code: assume compromise, contain blast radius, and design for failure. Teams that buy the vendor narrative ship agentic features with broad tool privileges and discover the limits of model-layer defenses in production.

Controls that work

The control set that materially reduces risk in LLM-powered applications:

What to assess this quarter

Three exercises deliver outsized signal.

First, run a structured red-team exercise against your highest-stakes LLM application, focused on indirect prompt injection through real document inputs. The exercise should include the document types the system actually processes (PDFs from customers, scraped web pages, email attachments) rather than synthetic adversarial prompts. We typically scope these as two-week engagements with a defined success criterion: can the red team cause an unauthorized tool invocation or data egress through document-mediated injection? The pass rate on first attempts is low; in our experience fewer than one in three production systems pass without remediation.

Second, map the privilege boundary explicitly. Write down what the LLM can cause to happen without further human authorization. List every tool, every API the agent can reach, and every effect each can have. Verify the answer matches the design intent. This exercise alone surfaces the majority of architectural issues, because the gap between "what we built" and "what we thought we built" is consistently wider than teams expect.

Third, add LLM-specific signals to your SIEM: prompt anomalies, tool-call patterns, cost spikes per user and per session, and unusual output characteristics (unexpected URLs, base64 blobs, large data volumes in responses). Standard application signals will not catch the LLM-specific attack patterns; the detection content has to be built specifically.

For regulated industries, a fourth exercise: align the threat model documentation with the controls expected by your regulator. The EU AI Act brings concrete documentation requirements for high-risk AI systems, with the bulk of obligations applying from August 2026, and financial regulators including the OCC and FCA have issued guidance treating model risk management as applicable to LLM deployments. The compliance work is substantially easier when the threat model and control documentation already exist; retrofitting it under regulatory time pressure is consistently more expensive.

The bottom line

LLM-powered applications are not insecure by nature, but they are insecure by default for organizations that treat them as conventional web applications with a smarter backend. The threat model is genuinely different. The good news is that the controls that work are not exotic; they are application security fundamentals applied to a new privilege boundary.

The teams that get this right share a few habits. They treat the LLM as untrusted code from the day they write the design document. They invest in tool-layer authorization before they invest in prompt engineering. They build the audit and cost-monitoring infrastructure as part of the initial deployment, not after the first incident. And they take a clear position on what their agents can and cannot do without human authorization, rather than letting the boundary drift outward as new features ship.

The teams that get it wrong tend to share a different pattern: a belief that a sufficiently careful system prompt or a sufficiently capable model will handle the security problem. It will not. The supply chain for LLM-powered applications is wider than the model, includes data the application did not produce, and reaches into systems the security team has historically not had to reason about. Treating it as a supply-chain problem, with the inventory, provenance, and boundary discipline that implies, is the work that distinguishes production-grade AI applications from the prototypes that currently dominate the deployment base.