Adversarial Examples

What It Is

Adversarial examples are inputs deliberately modified to cause a model to make incorrect predictions, while appearing normal to humans.

For Vision Models

Add imperceptible pixel-level noise to an image that causes misclassification. A stop sign classified as a speed limit sign. A panda classified as a gibbon with 99% confidence.

For Language Models

Modify text at the character or token level — synonym substitution, homoglyphs, adversarial suffixes that cause specific model behaviors.

Attack Types

Type	Access	Method
White-box	Full model weights	Gradient-based optimization (FGSM, PGD, C&W)
Black-box	API only	Transfer attacks, query-based optimization
Physical	Real world	Printed patches, adversarial clothing

Common Attack Algorithms

Algorithm	Speed	Effectiveness
FGSM	Fast (single step)	Moderate
PGD	Medium (iterative)	High
C&W	Slow (optimization)	Very High
AutoAttack	Slow (ensemble)	State-of-art

Transfer Attacks

Adversarial examples crafted on one model often fool other models. This enables black-box attacks:

Train or obtain a local surrogate model
Craft adversarial examples on the surrogate (white-box)
Apply them to the target model (black-box)

Transfer rate: 30-70%, high enough to be a practical threat.

Security Implications

Malware detection: Modify malware to evade ML-based AV
Spam/phishing: Craft messages that bypass ML filters
Fraud detection: Modify transactions to avoid flagging
Facial recognition: Evade identification systems

AI Security Book