Next-Token Prediction & Inference
The Core Objective
Every autoregressive LLM has the same training objective: predict the next token given all previous tokens.
P(token_n | token_1, token_2, ..., token_n-1)
The model doesn't "understand" text. It learns a probability distribution over the vocabulary for what token is most likely to come next, given the context. To predict well, it must learn grammar, facts, reasoning, and even social dynamics from the statistics of the training data.
The Inference Process
When you send a message to Claude or ChatGPT, here's what happens:
- Your text is tokenized into integer IDs
- Token IDs are converted to embedding vectors
- Positional encoding is added
- The sequence passes through all transformer layers (~80-120)
- The final hidden state of the last token is projected to vocabulary size
- Softmax converts to probabilities over all ~100K tokens
- A token is sampled from this distribution
- That token is appended to the sequence
- Repeat from step 3 until a stop condition is met
Key insight: Processing your input prompt is parallelized (all tokens processed simultaneously). Generating the response is sequential — one forward pass per output token. That's why responses stream in token by token.
Sampling Strategies
The model doesn't always pick the highest-probability token. Sampling controls the randomness:
| Parameter | What It Does | Effect |
|---|---|---|
| Temperature | Scales logits before softmax. T=0 → always pick top token. T=1 → standard distribution. T>1 → more random. | Controls creativity vs. determinism |
| Top-k | Only consider the top k highest-probability tokens | Cuts off unlikely tokens |
| Top-p (nucleus) | Only consider tokens whose cumulative probability reaches p | Dynamically adjusts based on confidence |
Temperature 0.0: "The capital of France is Paris."
Temperature 0.7: "The capital of France is Paris, a beautiful city."
Temperature 1.5: "The capital of France is Paris, where the moon dances on cobblestones."
Context Window
The model can only process a fixed number of tokens at once:
| Model | Context Window |
|---|---|
| GPT-3.5 | 4K / 16K tokens |
| GPT-4 | 8K / 32K / 128K tokens |
| Claude 3.5 Sonnet | 200K tokens |
| Llama 3 | 8K / 128K tokens |
| Gemini 1.5 Pro | 1M+ tokens |
Everything — system prompt, conversation history, retrieved documents, and the response being generated — must fit within this window.
Security Relevance
Context window stuffing. Attackers can fill the context with padding tokens to push the system prompt or safety instructions out of the window, weakening the model's ability to follow them.
Temperature manipulation. Higher temperature can make safety guardrails less reliable because the model samples from a broader distribution, increasing the chance of unsafe continuations.
Token budget exhaustion. Crafted inputs that cause the model to generate extremely long outputs can exhaust rate limits and compute budgets — a form of denial of service.
Prompt position matters. Instructions at the beginning and end of the context window receive more attention than those in the middle. Attackers exploit this to override system prompts.