Pre-Training
What It Is
Pre-training is the first and most expensive phase of building an LLM. The model learns to predict the next token on trillions of tokens of text, developing general language understanding, world knowledge, and reasoning capabilities.
The Training Objective
Causal language modeling: Given tokens 1 through n, predict token n+1.
The loss function is cross-entropy — it measures how far the model's predicted probability distribution is from the actual next token. Training minimizes this loss across the entire dataset.
Loss = -Σ log P(actual_next_token | context)
The Data
Pre-training data comes from internet scrapes, books, academic papers, code repositories, and curated datasets:
| Source | Examples | Contribution |
|---|---|---|
| Web crawl | Common Crawl, WebText | General knowledge, language patterns |
| Books | Books3, Project Gutenberg | Long-form reasoning, literary knowledge |
| Code | GitHub, StackOverflow | Programming ability, logical structure |
| Academic | arXiv, PubMed, Wikipedia | Technical knowledge, factual grounding |
| Curated | Custom licensed datasets | Quality control, domain coverage |
Modern frontier models train on 1-15 trillion tokens. The data is deduplicated, filtered for quality, and sometimes weighted by domain.
The Compute
| Resource | Scale |
|---|---|
| GPUs | 1,000 - 25,000+ (H100s or A100s) |
| Training time | 2-6 months |
| Cost | $50M - $500M+ |
| Power | Equivalent of a small town |
Pre-training is a massive distributed computing problem. The model weights, gradients, and data are partitioned across thousands of GPUs using parallelism strategies (data parallel, tensor parallel, pipeline parallel).
What Emerges
The model isn't explicitly taught grammar, facts, or reasoning. These capabilities emerge from the objective of predicting the next token well enough at scale:
- Grammar and syntax — emerge from statistical patterns in language
- World knowledge — emerges from predicting factual completions
- Reasoning — emerges from predicting logical next steps in arguments
- Code generation — emerges from predicting the next line of code
- Multilingual ability — emerges from training on text in many languages
Security Relevance
Data poisoning is most effective here. Corrupting pre-training data has the highest impact because it affects the model's fundamental knowledge. The sheer volume of data makes comprehensive auditing impractical.
Memorization happens during pre-training. The model memorizes unique or repeated sequences from training data — including PII, credentials, and proprietary content. This is what training data extraction attacks target.
Pre-training data shapes bias. The model inherits biases present in the training corpus. These biases affect outputs and can create liability for enterprises deploying the model.
Cost makes re-training prohibitive. You can't easily "patch" a pre-trained model. If poisoning is discovered, the fix is another multi-month, multi-million-dollar training run.