Data Poisoning & Backdoors

What It Is

Data poisoning targets the training pipeline. By injecting malicious samples into the training data, an attacker can influence what the model learns — introducing backdoors, biases, or degraded performance.

Attack Types

Availability Poisoning

Degrade overall model performance by injecting noisy or contradictory data.

  • Method: Add random labels, contradictory examples, or garbage data
  • Goal: Make the model less accurate on all inputs
  • Difficulty: Low — quantity over quality

Targeted Poisoning

Make the model misbehave on specific inputs while maintaining normal performance otherwise.

  • Method: Add carefully crafted samples that associate a trigger with a target behavior
  • Goal: Specific misclassification or behavioral change
  • Difficulty: Medium

Backdoor Attacks

A hidden trigger causes specific targeted behavior:

ComponentDescription
TriggerA specific pattern in the input (word, phrase, pixel pattern)
PayloadThe behavior activated by the trigger
StealthNormal behavior on all non-triggered inputs

Attack Surface

Entry PointHow
Web scrapingPoison pages that will be scraped for training
Open datasetsContribute poisoned samples to public datasets
Fine-tuning dataCompromise the curated fine-tuning dataset
User feedbackManipulate RLHF feedback to reward bad behavior
Domain expiryBuy expired domains in web crawl seeds

Real-World Feasibility

The Carlini et al. (2023) paper demonstrated that buying just 10 expired domains in Common Crawl's seed list was enough to control content seen by models training on this data. Cost: under $100.

Detection Challenges

  • Training datasets contain billions of examples — manual review is impossible
  • Sophisticated poisoning creates samples that are individually benign
  • Backdoor triggers activate only on specific inputs, making them hard to find via testing
  • Effects persist until the model is retrained