Data Poisoning & Backdoors

What It Is

Data poisoning targets the training pipeline. By injecting malicious samples into the training data, an attacker can influence what the model learns — introducing backdoors, biases, or degraded performance.

Attack Types

Availability Poisoning

Degrade overall model performance by injecting noisy or contradictory data.

Method: Add random labels, contradictory examples, or garbage data
Goal: Make the model less accurate on all inputs
Difficulty: Low — quantity over quality

Targeted Poisoning

Make the model misbehave on specific inputs while maintaining normal performance otherwise.

Method: Add carefully crafted samples that associate a trigger with a target behavior
Goal: Specific misclassification or behavioral change
Difficulty: Medium

Backdoor Attacks

A hidden trigger causes specific targeted behavior:

Component	Description
Trigger	A specific pattern in the input (word, phrase, pixel pattern)
Payload	The behavior activated by the trigger
Stealth	Normal behavior on all non-triggered inputs

Attack Surface

Entry Point	How
Web scraping	Poison pages that will be scraped for training
Open datasets	Contribute poisoned samples to public datasets
Fine-tuning data	Compromise the curated fine-tuning dataset
User feedback	Manipulate RLHF feedback to reward bad behavior
Domain expiry	Buy expired domains in web crawl seeds

Real-World Feasibility

The Carlini et al. (2023) paper demonstrated that buying just 10 expired domains in Common Crawl's seed list was enough to control content seen by models training on this data. Cost: under $100.

Detection Challenges

Training datasets contain billions of examples — manual review is impossible
Sophisticated poisoning creates samples that are individually benign
Backdoor triggers activate only on specific inputs, making them hard to find via testing
Effects persist until the model is retrained

AI Security Book