Model Extraction
What It Is
Model extraction (model stealing) creates a copy of a target model by querying its API and using the input-output pairs to train a functionally equivalent clone.
How It Works
Basic Extraction
- Send thousands of queries to the target API
- Collect input-output pairs
- Train a local model on these pairs (knowledge distillation)
- The clone mimics the target's behavior
Advanced Extraction
If the API returns probability distributions (logits) instead of just the top token, extraction becomes dramatically more efficient — logits contain far more information than discrete outputs.
Resource Requirements
| Target Model Size | Queries Needed | Local Compute | API Cost |
|---|---|---|---|
| Small classifier | 10K-100K | 1 GPU, hours | $10-100 |
| Medium model | 100K-1M | 4 GPUs, days | $100-1K |
| Large LLM | 1M-10M+ | GPU cluster | $1K-10K+ |
Why It Matters
- IP theft: Billions in training costs stolen
- Attack development: Clone the model locally to develop attacks in a white-box setting, then deploy against the real model
- Competitive advantage: Replicate a competitor's proprietary model
Defenses
| Defense | How It Works | Weakness |
|---|---|---|
| Rate limiting | Cap queries per user/time | Multiple accounts |
| Output perturbation | Add noise to logits | Degrades legitimate service |
| Query monitoring | Detect extraction patterns | Sophisticated attackers mimic normal usage |
| Watermarking | Embed detectable signal | Only proves theft, doesn't prevent it |