Data Poisoning & Backdoors
What It Is
Data poisoning targets the training pipeline. By injecting malicious samples into the training data, an attacker can influence what the model learns — introducing backdoors, biases, or degraded performance.
Attack Types
Availability Poisoning
Degrade overall model performance by injecting noisy or contradictory data.
- Method: Add random labels, contradictory examples, or garbage data
- Goal: Make the model less accurate on all inputs
- Difficulty: Low — quantity over quality
Targeted Poisoning
Make the model misbehave on specific inputs while maintaining normal performance otherwise.
- Method: Add carefully crafted samples that associate a trigger with a target behavior
- Goal: Specific misclassification or behavioral change
- Difficulty: Medium
Backdoor Attacks
A hidden trigger causes specific targeted behavior:
| Component | Description |
|---|---|
| Trigger | A specific pattern in the input (word, phrase, pixel pattern) |
| Payload | The behavior activated by the trigger |
| Stealth | Normal behavior on all non-triggered inputs |
Attack Surface
| Entry Point | How |
|---|---|
| Web scraping | Poison pages that will be scraped for training |
| Open datasets | Contribute poisoned samples to public datasets |
| Fine-tuning data | Compromise the curated fine-tuning dataset |
| User feedback | Manipulate RLHF feedback to reward bad behavior |
| Domain expiry | Buy expired domains in web crawl seeds |
Real-World Feasibility
The Carlini et al. (2023) paper demonstrated that buying just 10 expired domains in Common Crawl's seed list was enough to control content seen by models training on this data. Cost: under $100.
Detection Challenges
- Training datasets contain billions of examples — manual review is impossible
- Sophisticated poisoning creates samples that are individually benign
- Backdoor triggers activate only on specific inputs, making them hard to find via testing
- Effects persist until the model is retrained