Use Case GPU VRAM Cost (approx.)
7-8B models (Llama 3 8B, Mistral 7B) RTX 4070 Ti 12GB $600-800
13B models (quantized 70B) RTX 4090 24GB $1,500-2,000
70B models (full precision) 2x A100 80GB 160GB Cloud rental
Fine-tuning (LoRA) RTX 4090 or A100 24-80GB $1,500+ or cloud
For getting started, a single RTX 4090 handles most red team use cases.
# Ollama — simplest option
curl -fsSL https://ollama.ai/install.sh | sh
ollama pull llama3
ollama pull mistral
# vLLM — production API server
pip install vllm
python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-8B
# llama.cpp — CPU/GPU inference, GGUF format
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make
./main -m models/llama-3-8b.Q4_K_M.gguf -p "Hello"
# Axolotl — easiest fine-tuning framework
pip install axolotl
# Configure a LoRA fine-tune in YAML and run
# Hugging Face Transformers + PEFT
pip install transformers peft trl datasets
Model Why Size
Llama 3 8B Fast, capable, good baseline ~4.5GB (Q4)
Mistral 7B Strong reasoning, efficient ~4GB (Q4)
Llama 3 70B Closest to frontier model behavior ~40GB (Q4)
Mixtral 8x7B MoE architecture, good balance ~26GB (Q4)
□ GPU with 24GB+ VRAM installed and drivers updated
□ CUDA toolkit installed
□ Ollama installed with Llama 3 and Mistral pulled
□ Python environment with transformers, torch, vllm
□ Garak installed for scanning
□ PyRIT installed for orchestration
□ Test target deployed (local chatbot with system prompt)
□ Logging infrastructure (save all inputs and outputs)