Next-Token Prediction & Inference

The Core Objective

Every autoregressive LLM has the same training objective: predict the next token given all previous tokens.

P(token_n | token_1, token_2, ..., token_n-1)

The model doesn't "understand" text. It learns a probability distribution over the vocabulary for what token is most likely to come next, given the context. To predict well, it must learn grammar, facts, reasoning, and even social dynamics from the statistics of the training data.

The Inference Process

When you send a message to Claude or ChatGPT, here's what happens:

Your text is tokenized into integer IDs
Token IDs are converted to embedding vectors
Positional encoding is added
The sequence passes through all transformer layers (~80-120)
The final hidden state of the last token is projected to vocabulary size
Softmax converts to probabilities over all ~100K tokens
A token is sampled from this distribution
That token is appended to the sequence
Repeat from step 3 until a stop condition is met

Key insight: Processing your input prompt is parallelized (all tokens processed simultaneously). Generating the response is sequential — one forward pass per output token. That's why responses stream in token by token.

Sampling Strategies

The model doesn't always pick the highest-probability token. Sampling controls the randomness:

Parameter	What It Does	Effect
Temperature	Scales logits before softmax. T=0 → always pick top token. T=1 → standard distribution. T>1 → more random.	Controls creativity vs. determinism
Top-k	Only consider the top k highest-probability tokens	Cuts off unlikely tokens
Top-p (nucleus)	Only consider tokens whose cumulative probability reaches p	Dynamically adjusts based on confidence

Temperature 0.0: "The capital of France is Paris."
Temperature 0.7: "The capital of France is Paris, a beautiful city."
Temperature 1.5: "The capital of France is Paris, where the moon dances on cobblestones."

Context Window

The model can only process a fixed number of tokens at once:

Model	Context Window
GPT-3.5	4K / 16K tokens
GPT-4	8K / 32K / 128K tokens
Claude 3.5 Sonnet	200K tokens
Llama 3	8K / 128K tokens
Gemini 1.5 Pro	1M+ tokens

Everything — system prompt, conversation history, retrieved documents, and the response being generated — must fit within this window.

Security Relevance

Context window stuffing. Attackers can fill the context with padding tokens to push the system prompt or safety instructions out of the window, weakening the model's ability to follow them.

Temperature manipulation. Higher temperature can make safety guardrails less reliable because the model samples from a broader distribution, increasing the chance of unsafe continuations.

Token budget exhaustion. Crafted inputs that cause the model to generate extremely long outputs can exhaust rate limits and compute budgets — a form of denial of service.

Prompt position matters. Instructions at the beginning and end of the context window receive more attention than those in the middle. Attackers exploit this to override system prompts.