Deepfakes & Synthetic Media

Types of Synthetic Media

TypeTechnologyCurrent QualityDetection Difficulty
Voice cloningNeural TTS, voice conversionVery HighHard
Face swap (video)GAN-based, diffusion-basedHighMedium
Full synthetic videoVideo diffusion modelsMedium-HighMedium
Synthetic imagesStable Diffusion, DALL-E, MidjourneyVery HighHard
Text generationLLMsVery HighVery Hard

Voice Cloning Deep Dive

Requirements

  • Sample audio: 3-60 seconds depending on the tool
  • Compute: Consumer GPU or cloud API
  • Cost: Free (open source) to $5-50/month (commercial APIs)

Tools

ToolTypeSample NeededQuality
ElevenLabsCommercial API30 secondsVery High
Tortoise-TTSOpen source5-30 secondsHigh
VALL-E / VALL-E XResearch3 secondsVery High
RVC (Retrieval-Based Voice Conversion)Open source10+ minutes for trainingHigh
So-VITS-SVCOpen source30+ minutes for trainingHigh

Attack Scenarios

  • Executive impersonation for wire transfer authorization
  • Bypassing voice-based authentication systems
  • Generating fake audio evidence
  • Vishing at scale — personalized voice calls to hundreds of targets

Defense

ApproachWhat It DoesLimitations
Audio watermarkingEmbed imperceptible markers in legitimate audioOnly works for content you generate
Liveness detectionCheck for signs of real-time human speechCan be bypassed with high-quality clones
Provenance trackingC2PA/Content Credentials standardAdoption still early
Employee trainingTeach verification proceduresHuman factor — people still get fooled
Callback verificationAlways call back on known numbersDoesn't scale, not always followed