Integrity — Poisoning, Manipulation & Hallucination

AI-Specific Integrity Threats

Data Poisoning

Corrupted training or fine-tuning data leads to compromised model behavior. The model works normally on most inputs but produces attacker-controlled outputs when specific triggers are present.

Enterprise risk: Any organization fine-tuning models on internal data is exposed. Supply chain compromise of pre-trained models is also a vector.

Prompt Injection

Real-time manipulation of model behavior by embedding adversarial instructions in input. This affects any LLM application processing untrusted content — chatbots, email assistants, document summarizers, RAG systems.

Hallucination

The model generates plausible but factually incorrect information with high confidence. This is not an attack but an inherent model behavior that creates integrity risk.

ScenarioHallucination Impact
Financial advisoryIncorrect figures lead to bad investment decisions
Legal researchFabricated case citations (documented in real lawsuits)
Medical triageIncorrect symptom assessment
Customer supportFalse policy information given to customers
Code generationSubtly incorrect code that introduces vulnerabilities

Model Tampering

Unauthorized modification of model weights, configuration files, serving parameters, or system prompts. Includes insider threats and supply chain compromise.

Controls

ControlPurposeImplementation
Data provenance trackingVerify origin and integrity of all training dataHash verification, signed datasets, audit trail
Input validationFilter and sanitize model inputsHeuristic filters, perplexity checks, input length limits
Output verificationCross-check AI outputs against trusted sourcesAutomated fact-checking, citation verification
Human-in-the-loopRequire human review for high-stakes AI decisionsApproval workflows, confidence thresholds
Model signingCryptographic verification of model file integrityHash comparison, digital signatures on model artifacts
Behavioral monitoringDetect anomalous model outputs indicating compromiseStatistical drift detection, output distribution monitoring
RAG groundingConnect model to verified knowledge sourcesReduces hallucination by providing factual context
Confidence scoringFlag low-confidence outputs for human reviewCalibrate and expose model uncertainty
Red team testingProactively test for manipulation vulnerabilitiesRegular AI red team engagements

Metrics

  • Hallucination rate on benchmark questions
  • Percentage of AI outputs reviewed by humans
  • Time since last red team assessment
  • Number of poisoning indicators detected in training pipeline
  • Model integrity verification frequency