Prompt & Output Filtering

Input Filtering (Prompt)

What to Filter

Category	Detection Method	Action
Known injection patterns	Pattern matching, classifier	Block or flag
Jailbreak attempts	ML classifier trained on jailbreak data	Block or flag
PII in prompts	NER + regex	Redact before sending to model
Excessive length	Token count	Truncate or reject
Encoded payloads	Base64/encoding detection	Decode and re-evaluate
Adversarial suffixes	Perplexity scoring	Flag high-perplexity inputs

No input filter can reliably block all prompt injection. Natural language is too flexible — any filter that blocks adversarial instructions will also block some legitimate requests. Filters reduce risk but do not eliminate it.

Output Filtering

What to Filter

Category	Detection Method	Action
PII in responses	NER + regex patterns	Redact before returning
Toxic/harmful content	Safety classifier	Block and return safe alternative
System prompt leakage	Pattern matching against known system prompt content	Block response
Hallucinated URLs	URL validation	Strip or flag unverifiable links
Code with vulnerabilities	Static analysis (basic)	Flag for review
Excessive confidence on uncertain topics	Calibration scoring	Add uncertainty disclaimers

Architecture

User input
  → Input filter (PII redaction, injection detection)
    → Model inference
      → Output filter (PII scan, safety check, leakage detection)
        → User response

Both filters should run as separate services from the model — if the model is compromised via injection, the output filter still catches dangerous responses.

Commercial Solutions

Product	Focus
Lakera Guard	Prompt injection detection
Rebuff	Prompt injection defense
Pangea	AI security platform with filtering
Guardrails AI	Open-source output validation
NeMo Guardrails (NVIDIA)	Programmable safety rails

AI Security Book

Prompt & Output Filtering

Input Filtering (Prompt)

What to Filter

Limitations

Output Filtering

What to Filter

Architecture

Commercial Solutions