Pre-Training

What It Is

Pre-training is the first and most expensive phase of building an LLM. The model learns to predict the next token on trillions of tokens of text, developing general language understanding, world knowledge, and reasoning capabilities.

The Training Objective

Causal language modeling: Given tokens 1 through n, predict token n+1.

The loss function is cross-entropy — it measures how far the model's predicted probability distribution is from the actual next token. Training minimizes this loss across the entire dataset.

Loss = -Σ log P(actual_next_token | context)

The Data

Pre-training data comes from internet scrapes, books, academic papers, code repositories, and curated datasets:

Source	Examples	Contribution
Web crawl	Common Crawl, WebText	General knowledge, language patterns
Books	Books3, Project Gutenberg	Long-form reasoning, literary knowledge
Code	GitHub, StackOverflow	Programming ability, logical structure
Academic	arXiv, PubMed, Wikipedia	Technical knowledge, factual grounding
Curated	Custom licensed datasets	Quality control, domain coverage

Modern frontier models train on 1-15 trillion tokens. The data is deduplicated, filtered for quality, and sometimes weighted by domain.

The Compute

Resource	Scale
GPUs	1,000 - 25,000+ (H100s or A100s)
Training time	2-6 months
Cost	$50M - $500M+
Power	Equivalent of a small town

Pre-training is a massive distributed computing problem. The model weights, gradients, and data are partitioned across thousands of GPUs using parallelism strategies (data parallel, tensor parallel, pipeline parallel).

What Emerges

The model isn't explicitly taught grammar, facts, or reasoning. These capabilities emerge from the objective of predicting the next token well enough at scale:

Grammar and syntax — emerge from statistical patterns in language
World knowledge — emerges from predicting factual completions
Reasoning — emerges from predicting logical next steps in arguments
Code generation — emerges from predicting the next line of code
Multilingual ability — emerges from training on text in many languages

Security Relevance

Data poisoning is most effective here. Corrupting pre-training data has the highest impact because it affects the model's fundamental knowledge. The sheer volume of data makes comprehensive auditing impractical.

Memorization happens during pre-training. The model memorizes unique or repeated sequences from training data — including PII, credentials, and proprietary content. This is what training data extraction attacks target.

Pre-training data shapes bias. The model inherits biases present in the training corpus. These biases affect outputs and can create liability for enterprises deploying the model.

Cost makes re-training prohibitive. You can't easily "patch" a pre-trained model. If poisoning is discovered, the fix is another multi-month, multi-million-dollar training run.

AI Security Book