Model Extraction

What It Is

Model extraction (model stealing) creates a copy of a target model by querying its API and using the input-output pairs to train a functionally equivalent clone.

How It Works

Basic Extraction

  1. Send thousands of queries to the target API
  2. Collect input-output pairs
  3. Train a local model on these pairs (knowledge distillation)
  4. The clone mimics the target's behavior

Advanced Extraction

If the API returns probability distributions (logits) instead of just the top token, extraction becomes dramatically more efficient — logits contain far more information than discrete outputs.

Resource Requirements

Target Model SizeQueries NeededLocal ComputeAPI Cost
Small classifier10K-100K1 GPU, hours$10-100
Medium model100K-1M4 GPUs, days$100-1K
Large LLM1M-10M+GPU cluster$1K-10K+

Why It Matters

  • IP theft: Billions in training costs stolen
  • Attack development: Clone the model locally to develop attacks in a white-box setting, then deploy against the real model
  • Competitive advantage: Replicate a competitor's proprietary model

Defenses

DefenseHow It WorksWeakness
Rate limitingCap queries per user/timeMultiple accounts
Output perturbationAdd noise to logitsDegrades legitimate service
Query monitoringDetect extraction patternsSophisticated attackers mimic normal usage
WatermarkingEmbed detectable signalOnly proves theft, doesn't prevent it