Fine-Tuning & RLHF

The Problem

After pre-training, the model is a powerful text predictor — but not a useful assistant. Ask it a question and it might continue with another question, or generate a Wikipedia-style article, or produce harmful content. It doesn't follow instructions or behave helpfully.

Fine-tuning bridges this gap.

Supervised Fine-Tuning (SFT)

Human contractors write thousands of example conversations demonstrating ideal assistant behavior:

User: What's the capital of France?
Assistant: The capital of France is Paris.

User: Write me a haiku about security.
Assistant: Firewalls stand guard now / Silent packets cross the wire / Breach the last defense

The model trains on these examples using the same next-token prediction objective, learning the format, tone, and behavior expected of an assistant.

LoRA and QLoRA

Full fine-tuning updates all model parameters — expensive and requires the same compute as pre-training. LoRA (Low-Rank Adaptation) adds small trainable matrices alongside frozen model weights:

Base model weights: frozen (no changes)
LoRA adapters: small trainable matrices (0.1-1% of parameters)
Result: 90%+ reduction in training compute and memory

QLoRA goes further by quantizing the base model to 4-bit precision, enabling fine-tuning of 70B parameter models on a single GPU.

This is how you'd fine-tune a local model for red team tooling — LoRA adapters on top of a base Llama or Mistral model.

Reinforcement Learning from Human Feedback (RLHF)

SFT teaches format and basic behavior. RLHF teaches the model what humans actually prefer.

The Process

Generate responses: The SFT model produces multiple responses to the same prompt
Human ranking: Human raters rank responses from best to worst
Train reward model: A separate model learns to predict human preferences from these rankings
Optimize with RL: The main model is trained (via PPO or similar) to produce responses that score highly on the reward model

Why It Works

RLHF captures nuances that SFT can't — things like "this answer is technically correct but unhelpfully verbose" or "this response is helpful but has a slightly condescending tone." The reward model encodes these preferences, and RL pushes the main model toward them.

Direct Preference Optimization (DPO)

An alternative to RLHF that skips the reward model entirely. Instead of training a separate reward model and running RL, DPO directly optimizes the language model on preference pairs:

Preferred response (what humans chose as better)
Rejected response (what humans chose as worse)

DPO is simpler, more stable, and increasingly popular. Many newer models use DPO or variants instead of full RLHF.

Constitutional AI (CAI)

Anthropic's approach for Claude. Instead of relying solely on human raters, the model critiques its own outputs against a set of principles ("be helpful, be harmless, be honest") and generates revised responses. This self-improvement loop reduces dependence on human labor while scaling alignment.

Security Relevance

Safety training is a soft layer. All of these alignment techniques produce learned behavioral patterns, not architectural constraints. The model was taught to refuse — it wasn't built to be incapable. This is why jailbreaking works.

Fine-tuning can undo safety. If you fine-tune a model on examples that include harmful behavior (even a few hundred examples), you can override the alignment training. This is a real threat with open-weight models — anyone can fine-tune away the guardrails.

Reward model hacking. The reward model has its own blind spots. Responses can be optimized to score highly on the reward model without actually being good — a form of Goodhart's Law. This can produce outputs that seem safe but aren't.

RLHF creates the "mode" that jailbreaks target. The assistant persona is a trained behavior. Jailbreaks work by pushing the model out of this mode and back into the base model's raw behavior.

AI Security Book