Tokenization

What It Is

Tokenization converts raw text into a sequence of integer IDs that the model can process. Neural networks can't read — they only understand numbers. The tokenizer is the translation layer.

How BPE (Byte-Pair Encoding) Works

Most modern LLMs use Byte-Pair Encoding or a variant (SentencePiece, tiktoken). The algorithm:

  1. Start with individual characters as the initial vocabulary
  2. Count every adjacent pair of tokens across the entire corpus
  3. Merge the most frequent pair into a single new token
  4. Repeat until vocabulary reaches target size (typically 32K–100K tokens)

The result: common words become single tokens, rare words get split into subword pieces.

Examples

Input TextTokensToken Count
the cat sat[the] [cat] [sat]3
cybersecurity[cyber] [security]2
defenestration[def] [en] [est] [ration]4
こんにちは[こん] [にち] [は]3
SELECT * FROM[SELECT] [ *] [ FROM]3

Key Properties

Tokens are not words. They're subword units. Whitespace, punctuation, and even partial words can be individual tokens.

Common words are cheap. "the", "and", "is" are single tokens. Rare or technical words cost more tokens.

Non-English text is expensive. The vocabulary was built primarily on English text, so other languages and scripts require more tokens per character.

Code tokenizes differently than prose. Variable names, operators, and indentation patterns all affect token counts.

Tokenizer Differences by Model

Model FamilyTokenizerVocab Size
GPT-4 / ChatGPTtiktoken (cl100k_base)~100K
ClaudeSentencePiece (custom)~100K
Llama 2/3SentencePiece (BPE)32K / 128K
MistralSentencePiece (BPE)32K

Security Relevance

Token-level manipulation. Adversarial attacks can exploit tokenization boundaries. Two strings that look similar to humans may tokenize completely differently, and vice versa.

Context window limits. Every model has a maximum context window measured in tokens. Stuffing the context with padding tokens can push legitimate instructions out of the window.

Token smuggling. Some jailbreak techniques encode malicious instructions at the token level — using Unicode characters, zero-width spaces, or homoglyphs that tokenize into different sequences than expected.

Prompt injection via tokenization. If a system prompt uses tokens that the model treats differently than user input tokens, an attacker might exploit this asymmetry.

Hands-On

Check how text tokenizes using OpenAI's tokenizer tool:

https://platform.openai.com/tokenizer

Or programmatically with Python:

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4")
tokens = enc.encode("The hacker breached the firewall")
print(f"Tokens: {tokens}")
print(f"Count: {len(tokens)}")
# Decode each token to see the splits
for t in tokens:
    print(f"  {t} → '{enc.decode([t])}'")