Model Architectures

Overview

Not all AI models are the same architecture. Understanding the differences matters for red teaming because different architectures have different vulnerability profiles.

Decoder-Only (Autoregressive)

What it is: Generates text left to right, one token at a time. Each token can only attend to previous tokens (causal masking).

Models: GPT-4, Claude, Llama, Mistral, Gemini

Used for: Chatbots, text generation, code generation, reasoning

Security profile: Susceptible to prompt injection, jailbreaking, and next-token manipulation. The autoregressive nature means early tokens disproportionately influence later generation.

Encoder-Only

What it is: Processes the entire input bidirectionally (every token attends to every other token). Produces a representation of the input, not generated text.

Models: BERT, RoBERTa, DeBERTa

Used for: Classification, sentiment analysis, named entity recognition, embedding generation

Security profile: Susceptible to adversarial examples for classification evasion. Less relevant for prompt injection since they don't generate text.

Encoder-Decoder

What it is: Encoder processes the input bidirectionally, decoder generates output autoregressively while attending to the encoder's representation.

Models: T5, BART, Flan-T5

Used for: Translation, summarization, question answering

Security profile: Hybrid vulnerabilities — the encoder side is susceptible to adversarial input perturbation, the decoder side to generation-based attacks.

What it is: Instead of one massive feed-forward network, MoE uses multiple smaller "expert" networks. A routing mechanism selects which experts process each token. Only a fraction of parameters are active per forward pass.

Models: Mixtral, GPT-4 (rumored), Switch Transformer

Used for: Reducing inference cost while maintaining capacity

Security profile: Expert routing can be manipulated — adversarial inputs might trigger specific experts or avoid the expert that handles safety-relevant processing.

Diffusion Models

What it is: Generates output by iteratively denoising random noise. Used primarily for images, audio, and video.

Models: Stable Diffusion, DALL-E, Midjourney

Used for: Image generation, audio synthesis, video generation

Security profile: Susceptible to adversarial perturbation in the latent space, prompt injection via text encoder, and training data memorization (generating recognizable copyrighted images).

Multimodal Models

What it is: Combines multiple input types (text, images, audio, video) into a single model. Typically uses a vision encoder connected to an LLM backbone.

Models: GPT-4V/o, Claude 3 (vision), Gemini, LLaVA

Used for: Image understanding, document analysis, video analysis

Security profile: Cross-modal injection — hiding text instructions in images that the vision encoder reads but humans don't notice. This is a growing attack vector.

Model Size Reference

Model	Parameters	Architecture
GPT-2	1.5B	Decoder-only
Llama 2	7B / 13B / 70B	Decoder-only
Llama 3	8B / 70B / 405B	Decoder-only
Mixtral 8x7B	46.7B (12.9B active)	MoE Decoder-only
GPT-4	~1.8T (rumored)	MoE Decoder-only
BERT-large	340M	Encoder-only
T5-XXL	11B	Encoder-Decoder

AI Security Book