Jailbreaking
What It Is
Jailbreaking is the act of bypassing an LLM's safety training to make it produce content it was fine-tuned to refuse. The safety behavior is a learned behavioral layer — not an architectural constraint — which means it can be disrupted.
Why It Works
RLHF and SFT teach the model a "refusal mode" — when it encounters certain request patterns, it produces a canned refusal response. Jailbreaking works by avoiding those patterns while still conveying the same intent, or by pushing the model out of its "assistant mode" entirely.
Categories
| Category | Technique | Effectiveness |
|---|---|---|
| Persona/Roleplay | Assign the model a character without restrictions | Medium — widely patched but variants work |
| Encoding | Obfuscate the request so filters don't trigger | Medium — base64, ROT13, pig latin |
| Multi-turn | Gradually escalate across multiple messages | High — hardest to defend against |
| Gradient-based | Use optimization to find universal bypass suffixes | High — requires model access |
| Prefix injection | Start the model's response to bias continuation | Medium — model-dependent |
| Language switching | Request in one language, get response in another | Medium — less training data for non-English safety |