Pre-Training

What It Is

Pre-training is the first and most expensive phase of building an LLM. The model learns to predict the next token on trillions of tokens of text, developing general language understanding, world knowledge, and reasoning capabilities.

The Training Objective

Causal language modeling: Given tokens 1 through n, predict token n+1.

The loss function is cross-entropy — it measures how far the model's predicted probability distribution is from the actual next token. Training minimizes this loss across the entire dataset.

Loss = -Σ log P(actual_next_token | context)

The Data

Pre-training data comes from internet scrapes, books, academic papers, code repositories, and curated datasets:

SourceExamplesContribution
Web crawlCommon Crawl, WebTextGeneral knowledge, language patterns
BooksBooks3, Project GutenbergLong-form reasoning, literary knowledge
CodeGitHub, StackOverflowProgramming ability, logical structure
AcademicarXiv, PubMed, WikipediaTechnical knowledge, factual grounding
CuratedCustom licensed datasetsQuality control, domain coverage

Modern frontier models train on 1-15 trillion tokens. The data is deduplicated, filtered for quality, and sometimes weighted by domain.

The Compute

ResourceScale
GPUs1,000 - 25,000+ (H100s or A100s)
Training time2-6 months
Cost$50M - $500M+
PowerEquivalent of a small town

Pre-training is a massive distributed computing problem. The model weights, gradients, and data are partitioned across thousands of GPUs using parallelism strategies (data parallel, tensor parallel, pipeline parallel).

What Emerges

The model isn't explicitly taught grammar, facts, or reasoning. These capabilities emerge from the objective of predicting the next token well enough at scale:

  • Grammar and syntax — emerge from statistical patterns in language
  • World knowledge — emerges from predicting factual completions
  • Reasoning — emerges from predicting logical next steps in arguments
  • Code generation — emerges from predicting the next line of code
  • Multilingual ability — emerges from training on text in many languages

Security Relevance

Data poisoning is most effective here. Corrupting pre-training data has the highest impact because it affects the model's fundamental knowledge. The sheer volume of data makes comprehensive auditing impractical.

Memorization happens during pre-training. The model memorizes unique or repeated sequences from training data — including PII, credentials, and proprietary content. This is what training data extraction attacks target.

Pre-training data shapes bias. The model inherits biases present in the training corpus. These biases affect outputs and can create liability for enterprises deploying the model.

Cost makes re-training prohibitive. You can't easily "patch" a pre-trained model. If poisoning is discovered, the fix is another multi-month, multi-million-dollar training run.