Tokenization
What It Is
Tokenization converts raw text into a sequence of integer IDs that the model can process. Neural networks can't read — they only understand numbers. The tokenizer is the translation layer.
How BPE (Byte-Pair Encoding) Works
Most modern LLMs use Byte-Pair Encoding or a variant (SentencePiece, tiktoken). The algorithm:
- Start with individual characters as the initial vocabulary
- Count every adjacent pair of tokens across the entire corpus
- Merge the most frequent pair into a single new token
- Repeat until vocabulary reaches target size (typically 32K–100K tokens)
The result: common words become single tokens, rare words get split into subword pieces.
Examples
| Input Text | Tokens | Token Count |
|---|---|---|
the cat sat | [the] [cat] [sat] | 3 |
cybersecurity | [cyber] [security] | 2 |
defenestration | [def] [en] [est] [ration] | 4 |
こんにちは | [こん] [にち] [は] | 3 |
SELECT * FROM | [SELECT] [ *] [ FROM] | 3 |
Key Properties
Tokens are not words. They're subword units. Whitespace, punctuation, and even partial words can be individual tokens.
Common words are cheap. "the", "and", "is" are single tokens. Rare or technical words cost more tokens.
Non-English text is expensive. The vocabulary was built primarily on English text, so other languages and scripts require more tokens per character.
Code tokenizes differently than prose. Variable names, operators, and indentation patterns all affect token counts.
Tokenizer Differences by Model
| Model Family | Tokenizer | Vocab Size |
|---|---|---|
| GPT-4 / ChatGPT | tiktoken (cl100k_base) | ~100K |
| Claude | SentencePiece (custom) | ~100K |
| Llama 2/3 | SentencePiece (BPE) | 32K / 128K |
| Mistral | SentencePiece (BPE) | 32K |
Security Relevance
Token-level manipulation. Adversarial attacks can exploit tokenization boundaries. Two strings that look similar to humans may tokenize completely differently, and vice versa.
Context window limits. Every model has a maximum context window measured in tokens. Stuffing the context with padding tokens can push legitimate instructions out of the window.
Token smuggling. Some jailbreak techniques encode malicious instructions at the token level — using Unicode characters, zero-width spaces, or homoglyphs that tokenize into different sequences than expected.
Prompt injection via tokenization. If a system prompt uses tokens that the model treats differently than user input tokens, an attacker might exploit this asymmetry.
Hands-On
Check how text tokenizes using OpenAI's tokenizer tool:
https://platform.openai.com/tokenizer
Or programmatically with Python:
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
tokens = enc.encode("The hacker breached the firewall")
print(f"Tokens: {tokens}")
print(f"Count: {len(tokens)}")
# Decode each token to see the splits
for t in tokens:
print(f" {t} → '{enc.decode([t])}'")