Model Extraction

What It Is

Model extraction (model stealing) creates a copy of a target model by querying its API and using the input-output pairs to train a functionally equivalent clone.

How It Works

Basic Extraction

Send thousands of queries to the target API
Collect input-output pairs
Train a local model on these pairs (knowledge distillation)
The clone mimics the target's behavior

Advanced Extraction

If the API returns probability distributions (logits) instead of just the top token, extraction becomes dramatically more efficient — logits contain far more information than discrete outputs.

Resource Requirements

Target Model Size	Queries Needed	Local Compute	API Cost
Small classifier	10K-100K	1 GPU, hours	$10-100
Medium model	100K-1M	4 GPUs, days	$100-1K
Large LLM	1M-10M+	GPU cluster	$1K-10K+

Why It Matters

IP theft: Billions in training costs stolen
Attack development: Clone the model locally to develop attacks in a white-box setting, then deploy against the real model
Competitive advantage: Replicate a competitor's proprietary model

Defenses

Defense	How It Works	Weakness
Rate limiting	Cap queries per user/time	Multiple accounts
Output perturbation	Add noise to logits	Degrades legitimate service
Query monitoring	Detect extraction patterns	Sophisticated attackers mimic normal usage
Watermarking	Embed detectable signal	Only proves theft, doesn't prevent it

AI Security Book