Adversarial Examples
What It Is
Adversarial examples are inputs deliberately modified to cause a model to make incorrect predictions, while appearing normal to humans.
For Vision Models
Add imperceptible pixel-level noise to an image that causes misclassification. A stop sign classified as a speed limit sign. A panda classified as a gibbon with 99% confidence.
For Language Models
Modify text at the character or token level — synonym substitution, homoglyphs, adversarial suffixes that cause specific model behaviors.
Attack Types
| Type | Access | Method |
|---|---|---|
| White-box | Full model weights | Gradient-based optimization (FGSM, PGD, C&W) |
| Black-box | API only | Transfer attacks, query-based optimization |
| Physical | Real world | Printed patches, adversarial clothing |
Common Attack Algorithms
| Algorithm | Speed | Effectiveness |
|---|---|---|
| FGSM | Fast (single step) | Moderate |
| PGD | Medium (iterative) | High |
| C&W | Slow (optimization) | Very High |
| AutoAttack | Slow (ensemble) | State-of-art |
Transfer Attacks
Adversarial examples crafted on one model often fool other models. This enables black-box attacks:
- Train or obtain a local surrogate model
- Craft adversarial examples on the surrogate (white-box)
- Apply them to the target model (black-box)
Transfer rate: 30-70%, high enough to be a practical threat.
Security Implications
- Malware detection: Modify malware to evade ML-based AV
- Spam/phishing: Craft messages that bypass ML filters
- Fraud detection: Modify transactions to avoid flagging
- Facial recognition: Evade identification systems