Jailbreaking

What It Is

Jailbreaking is the act of bypassing an LLM's safety training to make it produce content it was fine-tuned to refuse. The safety behavior is a learned behavioral layer — not an architectural constraint — which means it can be disrupted.

Why It Works

RLHF and SFT teach the model a "refusal mode" — when it encounters certain request patterns, it produces a canned refusal response. Jailbreaking works by avoiding those patterns while still conveying the same intent, or by pushing the model out of its "assistant mode" entirely.

Categories

CategoryTechniqueEffectiveness
Persona/RoleplayAssign the model a character without restrictionsMedium — widely patched but variants work
EncodingObfuscate the request so filters don't triggerMedium — base64, ROT13, pig latin
Multi-turnGradually escalate across multiple messagesHigh — hardest to defend against
Gradient-basedUse optimization to find universal bypass suffixesHigh — requires model access
Prefix injectionStart the model's response to bias continuationMedium — model-dependent
Language switchingRequest in one language, get response in anotherMedium — less training data for non-English safety

Subsections