Adversarial Examples

What It Is

Adversarial examples are inputs deliberately modified to cause a model to make incorrect predictions, while appearing normal to humans.

For Vision Models

Add imperceptible pixel-level noise to an image that causes misclassification. A stop sign classified as a speed limit sign. A panda classified as a gibbon with 99% confidence.

For Language Models

Modify text at the character or token level — synonym substitution, homoglyphs, adversarial suffixes that cause specific model behaviors.

Attack Types

TypeAccessMethod
White-boxFull model weightsGradient-based optimization (FGSM, PGD, C&W)
Black-boxAPI onlyTransfer attacks, query-based optimization
PhysicalReal worldPrinted patches, adversarial clothing

Common Attack Algorithms

AlgorithmSpeedEffectiveness
FGSMFast (single step)Moderate
PGDMedium (iterative)High
C&WSlow (optimization)Very High
AutoAttackSlow (ensemble)State-of-art

Transfer Attacks

Adversarial examples crafted on one model often fool other models. This enables black-box attacks:

  1. Train or obtain a local surrogate model
  2. Craft adversarial examples on the surrogate (white-box)
  3. Apply them to the target model (black-box)

Transfer rate: 30-70%, high enough to be a practical threat.

Security Implications

  • Malware detection: Modify malware to evade ML-based AV
  • Spam/phishing: Craft messages that bypass ML filters
  • Fraud detection: Modify transactions to avoid flagging
  • Facial recognition: Evade identification systems