Prompt Injection

Overview

Prompt injection is the most critical vulnerability class in LLM applications. It occurs when an attacker embeds instructions inside input that the model processes, causing the model to follow the attacker's instructions instead of (or in addition to) the developer's.

It's the SQL injection of AI. The root cause is identical: no separation between code (instructions) and data (user input).

Why It's Unsolved

There is currently no reliable, general-purpose defense against prompt injection. Unlike SQL injection, which was solved with parameterized queries, LLMs process everything as natural language tokens — instructions and data share the same channel. Heuristic filters help but can always be bypassed with creative encoding.

Two Types

Direct Prompt Injection

The attacker is the user. They type adversarial instructions directly into the chat or API.

Goal: Make the model do something the developer prohibited — bypass content policies, extract the system prompt, generate restricted content.

Indirect Prompt Injection

The attacker plants instructions in content the model will read — a webpage, email, document, database record, or tool output. The victim is a different user whose AI assistant processes the poisoned content.

Goal: Execute actions on behalf of the victim — exfiltrate data, trigger tool calls, manipulate outputs, spread to other conversations.

Impact

ScenarioImpact
Chatbot with tool useAttacker triggers unauthorized API calls
RAG systemPoisoned document hijacks all responses
Email assistantMalicious email exfiltrates inbox contents
Code assistantInjected comment inserts vulnerable code
Customer support botAttacker extracts other customers' data

Subsections