PII in AI Pipelines

Where PII Appears

PII can enter and exit AI systems at every stage:

StagePII RiskExample
Training dataPII in the training corpusNames, emails in web scrapes
Fine-tuning dataPII in curated datasetsCustomer records used for fine-tuning
User inputUsers provide PII in prompts"Summarize this contract for John Smith, SSN 123-45-6789"
RAG retrievalPII in retrieved documentsKnowledge base contains customer records
Model outputModel generates or reproduces PIIMemorized training data, or user PII echoed back
LogsPII captured in conversation logsFull prompts and responses stored for debugging
EmbeddingsPII reconstructable from vectorsEmbedding inversion on RAG vector database

Controls by Pipeline Stage

Input Protection

  • PII detection and redaction before model processing
  • Named Entity Recognition (NER) to identify and mask PII
  • User-facing warnings about submitting sensitive data

Processing Protection

  • Minimize data passed to the model — only what's needed
  • System prompt instructions to not repeat PII
  • Token-level filtering in RAG retrieval

Output Protection

  • PII scanning on all model outputs before returning to user
  • Regex and NER-based detection for common PII patterns
  • Block responses containing detected PII patterns

Storage Protection

  • Encrypt conversation logs at rest
  • Minimize log retention period
  • Redact PII from logs before storage
  • Access control on log access

Common PII Patterns to Detect

PatternRegex Example
SSN\d{3}-\d{2}-\d{4}
Credit card\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}
Email[\w.+-]+@[\w-]+\.[\w.]+
Phone (US)\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}
IP address\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}
API key patternsProvider-specific prefixes (sk-, AKIA, etc.)