PII can enter and exit AI systems at every stage:
| Stage | PII Risk | Example |
| Training data | PII in the training corpus | Names, emails in web scrapes |
| Fine-tuning data | PII in curated datasets | Customer records used for fine-tuning |
| User input | Users provide PII in prompts | "Summarize this contract for John Smith, SSN 123-45-6789" |
| RAG retrieval | PII in retrieved documents | Knowledge base contains customer records |
| Model output | Model generates or reproduces PII | Memorized training data, or user PII echoed back |
| Logs | PII captured in conversation logs | Full prompts and responses stored for debugging |
| Embeddings | PII reconstructable from vectors | Embedding inversion on RAG vector database |
- PII detection and redaction before model processing
- Named Entity Recognition (NER) to identify and mask PII
- User-facing warnings about submitting sensitive data
- Minimize data passed to the model — only what's needed
- System prompt instructions to not repeat PII
- Token-level filtering in RAG retrieval
- PII scanning on all model outputs before returning to user
- Regex and NER-based detection for common PII patterns
- Block responses containing detected PII patterns
- Encrypt conversation logs at rest
- Minimize log retention period
- Redact PII from logs before storage
- Access control on log access
| Pattern | Regex Example |
| SSN | \d{3}-\d{2}-\d{4} |
| Credit card | \d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4} |
| Email | [\w.+-]+@[\w-]+\.[\w.]+ |
| Phone (US) | \(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4} |
| IP address | \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3} |
| API key patterns | Provider-specific prefixes (sk-, AKIA, etc.) |