Self-Attention & Transformers
Self-Attention in Plain Terms
For every token, the model asks: "Which other tokens in this sequence should I pay attention to right now?"
It scores every token against every other token. High score = high relevance. The result is a new representation of each token that incorporates context from the entire sequence.
The Q, K, V Mechanism
For each token, the model computes three vectors from its embedding:
| Vector | Role | Analogy |
|---|---|---|
| Query (Q) | "What am I looking for?" | Your search query |
| Key (K) | "What do I contain?" | The index entry |
| Value (V) | "What information do I provide?" | The actual data |
The Math
Attention(Q, K, V) = softmax(Q × K^T / √d_k) × V
- Q × K^T — dot product of query with every key. Produces attention scores.
- ÷ √d_k — scale down to prevent exploding gradients.
- softmax — normalize scores to sum to 1 (probability distribution).
- × V — weighted sum of value vectors based on attention weights.
Example
For the sentence "The hacker breached the firewall":
When processing the second "the", the model computes attention scores:
| Token | Attention Weight | Why |
|---|---|---|
| the (1st) | 0.05 | Low — generic word |
| hacker | 0.10 | Some relevance |
| breached | 0.35 | High — what happened? |
| the (2nd) | 0.05 | Self — less useful |
| firewall | 0.45 | Highest — what "the" refers to |
The output representation of "the" now contains information about "firewall" and "breached" — it knows it means "the firewall."
Multi-Head Attention
A single attention computation captures one type of relationship. Multi-head attention runs several attention operations in parallel, each with different learned Q/K/V projections:
- Head 1 might learn syntactic relationships (subject-verb)
- Head 2 might learn semantic relationships (what does "it" refer to?)
- Head 3 might learn positional proximity (nearby words)
- Head N might learn long-range dependencies
The outputs of all heads are concatenated and projected back to the model dimension.
Causal Masking
For autoregressive models (GPT, Claude, Llama), each token can only attend to tokens before it — not after. This is enforced with a causal mask that sets future positions to negative infinity before the softmax.
This is why LLMs can generate text left to right but can't "look ahead."
The Full Transformer Layer
One transformer layer consists of:
- Multi-head self-attention — context mixing between tokens
- Add & layer norm — residual connection + normalization (stabilizes training)
- Feed-forward network — two dense layers applied to each token independently
- Add & layer norm — another residual connection
Modern LLMs stack 80-120 of these layers. Each layer refines the representation.
Security Relevance
Attention hijacking. Prompt injection works partly because injected instructions can dominate the attention scores. If the attacker's text contains strong trigger words, the model's attention shifts away from the developer's instructions.
Attention sinks. Models tend to allocate disproportionate attention to certain positions (beginning of context, special tokens). This creates exploitable patterns.
Layer-wise behavior. Different attacks operate at different layer depths. Surface-level jailbreaks might exploit shallow layers (pattern matching), while reasoning-based attacks target deep layers (logic and planning).