Embeddings & Positional Encoding
Embeddings
After tokenization, each token ID is converted into a dense vector — a list of numbers (typically 4,096 to 12,288 dimensions for large models). This is done via a lookup in the embedding matrix, a massive table learned during training.
Why Vectors?
A token ID like 4523 is arbitrary — it tells the model nothing about meaning. The embedding vector encodes semantic relationships:
- Similar meanings → similar vectors. "Hacker" and "attacker" are close in embedding space.
- Different meanings → distant vectors. "Hacker" and "banana" are far apart.
- Relationships are directional. The vector from "king" to "queen" is roughly the same as "man" to "woman."
Embedding Arithmetic
This isn't a party trick — it's literal vector math:
embedding("king") - embedding("man") + embedding("woman") ≈ embedding("queen")
embedding("Paris") - embedding("France") + embedding("Germany") ≈ embedding("Berlin")
The model learns these relationships automatically from the statistical patterns in training data.
Dimensions
| Model | Embedding Dimensions |
|---|---|
| GPT-2 | 768 |
| GPT-3 | 12,288 |
| Llama 2 7B | 4,096 |
| Llama 2 70B | 8,192 |
| Claude (estimated) | 8,192+ |
More dimensions = more nuance in representing meaning, but more compute cost.
Positional Encoding
Embeddings alone have no concept of word order. "Dog bites man" and "man bites dog" produce the same set of embedding vectors — just in a different order. The model needs to know where each token sits in the sequence.
How It Works
Each position in the sequence (0, 1, 2, ...) gets its own vector, which is added to the token embedding. The combined vector now encodes both what the token is and where it is.
Methods
Sinusoidal (original transformer): Uses sine and cosine functions at different frequencies. Position 0 gets one pattern, position 1 gets another, etc. Fixed — not learned.
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
Learned positional embeddings: A trainable embedding matrix for positions, just like the token embeddings. Most modern models use this.
RoPE (Rotary Position Embedding): Used by Llama, Mistral, and many recent models. Encodes position as a rotation in embedding space. Enables better generalization to longer sequences than seen during training.
Security Relevance
Embedding similarity enables transfer attacks. If two inputs have similar embeddings, they may trigger similar model behavior — even if the surface text looks different.
Positional attacks. Instructions placed at the beginning of the context window tend to carry more weight than instructions buried in the middle (the "lost in the middle" phenomenon). Attackers exploit this by front-loading injected instructions.
Embedding inversion. Given a model's embeddings (e.g., from a vector database), it's possible to approximately reconstruct the original text — a privacy risk for RAG systems storing sensitive documents.