Jailbreaking

What It Is

Jailbreaking is the act of bypassing an LLM's safety training to make it produce content it was fine-tuned to refuse. The safety behavior is a learned behavioral layer — not an architectural constraint — which means it can be disrupted.

Why It Works

RLHF and SFT teach the model a "refusal mode" — when it encounters certain request patterns, it produces a canned refusal response. Jailbreaking works by avoiding those patterns while still conveying the same intent, or by pushing the model out of its "assistant mode" entirely.

Categories

Category	Technique	Effectiveness
Persona/Roleplay	Assign the model a character without restrictions	Medium — widely patched but variants work
Encoding	Obfuscate the request so filters don't trigger	Medium — base64, ROT13, pig latin
Multi-turn	Gradually escalate across multiple messages	High — hardest to defend against
Gradient-based	Use optimization to find universal bypass suffixes	High — requires model access
Prefix injection	Start the model's response to bias continuation	Medium — model-dependent
Language switching	Request in one language, get response in another	Medium — less training data for non-English safety

AI Security Book

Jailbreaking

What It Is

Why It Works

Categories

Subsections