Self-Attention & Transformers

Self-Attention in Plain Terms

For every token, the model asks: "Which other tokens in this sequence should I pay attention to right now?"

It scores every token against every other token. High score = high relevance. The result is a new representation of each token that incorporates context from the entire sequence.

The Q, K, V Mechanism

For each token, the model computes three vectors from its embedding:

VectorRoleAnalogy
Query (Q)"What am I looking for?"Your search query
Key (K)"What do I contain?"The index entry
Value (V)"What information do I provide?"The actual data

The Math

Attention(Q, K, V) = softmax(Q × K^T / √d_k) × V
  1. Q × K^T — dot product of query with every key. Produces attention scores.
  2. ÷ √d_k — scale down to prevent exploding gradients.
  3. softmax — normalize scores to sum to 1 (probability distribution).
  4. × V — weighted sum of value vectors based on attention weights.

Example

For the sentence "The hacker breached the firewall":

When processing the second "the", the model computes attention scores:

TokenAttention WeightWhy
the (1st)0.05Low — generic word
hacker0.10Some relevance
breached0.35High — what happened?
the (2nd)0.05Self — less useful
firewall0.45Highest — what "the" refers to

The output representation of "the" now contains information about "firewall" and "breached" — it knows it means "the firewall."

Multi-Head Attention

A single attention computation captures one type of relationship. Multi-head attention runs several attention operations in parallel, each with different learned Q/K/V projections:

  • Head 1 might learn syntactic relationships (subject-verb)
  • Head 2 might learn semantic relationships (what does "it" refer to?)
  • Head 3 might learn positional proximity (nearby words)
  • Head N might learn long-range dependencies

The outputs of all heads are concatenated and projected back to the model dimension.

Causal Masking

For autoregressive models (GPT, Claude, Llama), each token can only attend to tokens before it — not after. This is enforced with a causal mask that sets future positions to negative infinity before the softmax.

This is why LLMs can generate text left to right but can't "look ahead."

The Full Transformer Layer

One transformer layer consists of:

  1. Multi-head self-attention — context mixing between tokens
  2. Add & layer norm — residual connection + normalization (stabilizes training)
  3. Feed-forward network — two dense layers applied to each token independently
  4. Add & layer norm — another residual connection

Modern LLMs stack 80-120 of these layers. Each layer refines the representation.

Security Relevance

Attention hijacking. Prompt injection works partly because injected instructions can dominate the attention scores. If the attacker's text contains strong trigger words, the model's attention shifts away from the developer's instructions.

Attention sinks. Models tend to allocate disproportionate attention to certain positions (beginning of context, special tokens). This creates exploitable patterns.

Layer-wise behavior. Different attacks operate at different layer depths. Surface-level jailbreaks might exploit shallow layers (pattern matching), while reasoning-based attacks target deep layers (logic and planning).