Prompt & Output Filtering

Input Filtering (Prompt)

What to Filter

CategoryDetection MethodAction
Known injection patternsPattern matching, classifierBlock or flag
Jailbreak attemptsML classifier trained on jailbreak dataBlock or flag
PII in promptsNER + regexRedact before sending to model
Excessive lengthToken countTruncate or reject
Encoded payloadsBase64/encoding detectionDecode and re-evaluate
Adversarial suffixesPerplexity scoringFlag high-perplexity inputs

Limitations

No input filter can reliably block all prompt injection. Natural language is too flexible — any filter that blocks adversarial instructions will also block some legitimate requests. Filters reduce risk but do not eliminate it.

Output Filtering

What to Filter

CategoryDetection MethodAction
PII in responsesNER + regex patternsRedact before returning
Toxic/harmful contentSafety classifierBlock and return safe alternative
System prompt leakagePattern matching against known system prompt contentBlock response
Hallucinated URLsURL validationStrip or flag unverifiable links
Code with vulnerabilitiesStatic analysis (basic)Flag for review
Excessive confidence on uncertain topicsCalibration scoringAdd uncertainty disclaimers

Architecture

User input
  → Input filter (PII redaction, injection detection)
    → Model inference
      → Output filter (PII scan, safety check, leakage detection)
        → User response

Both filters should run as separate services from the model — if the model is compromised via injection, the output filter still catches dangerous responses.

Commercial Solutions

ProductFocus
Lakera GuardPrompt injection detection
RebuffPrompt injection defense
PangeaAI security platform with filtering
Guardrails AIOpen-source output validation
NeMo Guardrails (NVIDIA)Programmable safety rails