Next-Token Prediction & Inference

The Core Objective

Every autoregressive LLM has the same training objective: predict the next token given all previous tokens.

P(token_n | token_1, token_2, ..., token_n-1)

The model doesn't "understand" text. It learns a probability distribution over the vocabulary for what token is most likely to come next, given the context. To predict well, it must learn grammar, facts, reasoning, and even social dynamics from the statistics of the training data.

The Inference Process

When you send a message to Claude or ChatGPT, here's what happens:

  1. Your text is tokenized into integer IDs
  2. Token IDs are converted to embedding vectors
  3. Positional encoding is added
  4. The sequence passes through all transformer layers (~80-120)
  5. The final hidden state of the last token is projected to vocabulary size
  6. Softmax converts to probabilities over all ~100K tokens
  7. A token is sampled from this distribution
  8. That token is appended to the sequence
  9. Repeat from step 3 until a stop condition is met

Key insight: Processing your input prompt is parallelized (all tokens processed simultaneously). Generating the response is sequential — one forward pass per output token. That's why responses stream in token by token.

Sampling Strategies

The model doesn't always pick the highest-probability token. Sampling controls the randomness:

ParameterWhat It DoesEffect
TemperatureScales logits before softmax. T=0 → always pick top token. T=1 → standard distribution. T>1 → more random.Controls creativity vs. determinism
Top-kOnly consider the top k highest-probability tokensCuts off unlikely tokens
Top-p (nucleus)Only consider tokens whose cumulative probability reaches pDynamically adjusts based on confidence
Temperature 0.0: "The capital of France is Paris."
Temperature 0.7: "The capital of France is Paris, a beautiful city."
Temperature 1.5: "The capital of France is Paris, where the moon dances on cobblestones."

Context Window

The model can only process a fixed number of tokens at once:

ModelContext Window
GPT-3.54K / 16K tokens
GPT-48K / 32K / 128K tokens
Claude 3.5 Sonnet200K tokens
Llama 38K / 128K tokens
Gemini 1.5 Pro1M+ tokens

Everything — system prompt, conversation history, retrieved documents, and the response being generated — must fit within this window.

Security Relevance

Context window stuffing. Attackers can fill the context with padding tokens to push the system prompt or safety instructions out of the window, weakening the model's ability to follow them.

Temperature manipulation. Higher temperature can make safety guardrails less reliable because the model samples from a broader distribution, increasing the chance of unsafe continuations.

Token budget exhaustion. Crafted inputs that cause the model to generate extremely long outputs can exhaust rate limits and compute budgets — a form of denial of service.

Prompt position matters. Instructions at the beginning and end of the context window receive more attention than those in the middle. Attackers exploit this to override system prompts.