Integrity — Poisoning, Manipulation & Hallucination
AI-Specific Integrity Threats
Data Poisoning
Corrupted training or fine-tuning data leads to compromised model behavior. The model works normally on most inputs but produces attacker-controlled outputs when specific triggers are present.
Enterprise risk: Any organization fine-tuning models on internal data is exposed. Supply chain compromise of pre-trained models is also a vector.
Prompt Injection
Real-time manipulation of model behavior by embedding adversarial instructions in input. This affects any LLM application processing untrusted content — chatbots, email assistants, document summarizers, RAG systems.
Hallucination
The model generates plausible but factually incorrect information with high confidence. This is not an attack but an inherent model behavior that creates integrity risk.
| Scenario | Hallucination Impact |
|---|---|
| Financial advisory | Incorrect figures lead to bad investment decisions |
| Legal research | Fabricated case citations (documented in real lawsuits) |
| Medical triage | Incorrect symptom assessment |
| Customer support | False policy information given to customers |
| Code generation | Subtly incorrect code that introduces vulnerabilities |
Model Tampering
Unauthorized modification of model weights, configuration files, serving parameters, or system prompts. Includes insider threats and supply chain compromise.
Controls
| Control | Purpose | Implementation |
|---|---|---|
| Data provenance tracking | Verify origin and integrity of all training data | Hash verification, signed datasets, audit trail |
| Input validation | Filter and sanitize model inputs | Heuristic filters, perplexity checks, input length limits |
| Output verification | Cross-check AI outputs against trusted sources | Automated fact-checking, citation verification |
| Human-in-the-loop | Require human review for high-stakes AI decisions | Approval workflows, confidence thresholds |
| Model signing | Cryptographic verification of model file integrity | Hash comparison, digital signatures on model artifacts |
| Behavioral monitoring | Detect anomalous model outputs indicating compromise | Statistical drift detection, output distribution monitoring |
| RAG grounding | Connect model to verified knowledge sources | Reduces hallucination by providing factual context |
| Confidence scoring | Flag low-confidence outputs for human review | Calibrate and expose model uncertainty |
| Red team testing | Proactively test for manipulation vulnerabilities | Regular AI red team engagements |
Metrics
- Hallucination rate on benchmark questions
- Percentage of AI outputs reviewed by humans
- Time since last red team assessment
- Number of poisoning indicators detected in training pipeline
- Model integrity verification frequency