Deepfakes & Synthetic Media
Types of Synthetic Media
| Type | Technology | Current Quality | Detection Difficulty |
|---|---|---|---|
| Voice cloning | Neural TTS, voice conversion | Very High | Hard |
| Face swap (video) | GAN-based, diffusion-based | High | Medium |
| Full synthetic video | Video diffusion models | Medium-High | Medium |
| Synthetic images | Stable Diffusion, DALL-E, Midjourney | Very High | Hard |
| Text generation | LLMs | Very High | Very Hard |
Voice Cloning Deep Dive
Requirements
- Sample audio: 3-60 seconds depending on the tool
- Compute: Consumer GPU or cloud API
- Cost: Free (open source) to $5-50/month (commercial APIs)
Tools
| Tool | Type | Sample Needed | Quality |
|---|---|---|---|
| ElevenLabs | Commercial API | 30 seconds | Very High |
| Tortoise-TTS | Open source | 5-30 seconds | High |
| VALL-E / VALL-E X | Research | 3 seconds | Very High |
| RVC (Retrieval-Based Voice Conversion) | Open source | 10+ minutes for training | High |
| So-VITS-SVC | Open source | 30+ minutes for training | High |
Attack Scenarios
- Executive impersonation for wire transfer authorization
- Bypassing voice-based authentication systems
- Generating fake audio evidence
- Vishing at scale — personalized voice calls to hundreds of targets
Defense
| Approach | What It Does | Limitations |
|---|---|---|
| Audio watermarking | Embed imperceptible markers in legitimate audio | Only works for content you generate |
| Liveness detection | Check for signs of real-time human speech | Can be bypassed with high-quality clones |
| Provenance tracking | C2PA/Content Credentials standard | Adoption still early |
| Employee training | Teach verification procedures | Human factor — people still get fooled |
| Callback verification | Always call back on known numbers | Doesn't scale, not always followed |