Building a Local Lab

Hardware Requirements

Use CaseGPUVRAMCost (approx.)
7-8B models (Llama 3 8B, Mistral 7B)RTX 4070 Ti12GB$600-800
13B models (quantized 70B)RTX 409024GB$1,500-2,000
70B models (full precision)2x A100 80GB160GBCloud rental
Fine-tuning (LoRA)RTX 4090 or A10024-80GB$1,500+ or cloud

For getting started, a single RTX 4090 handles most red team use cases.

Software Stack

Inference (Running Models)

# Ollama — simplest option
curl -fsSL https://ollama.ai/install.sh | sh
ollama pull llama3
ollama pull mistral

# vLLM — production API server
pip install vllm
python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-8B

# llama.cpp — CPU/GPU inference, GGUF format
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make
./main -m models/llama-3-8b.Q4_K_M.gguf -p "Hello"

Fine-Tuning

# Axolotl — easiest fine-tuning framework
pip install axolotl
# Configure a LoRA fine-tune in YAML and run

# Hugging Face Transformers + PEFT
pip install transformers peft trl datasets

Models to Download

ModelWhySize
Llama 3 8BFast, capable, good baseline~4.5GB (Q4)
Mistral 7BStrong reasoning, efficient~4GB (Q4)
Llama 3 70BClosest to frontier model behavior~40GB (Q4)
Mixtral 8x7BMoE architecture, good balance~26GB (Q4)

Lab Setup Checklist

□ GPU with 24GB+ VRAM installed and drivers updated
□ CUDA toolkit installed
□ Ollama installed with Llama 3 and Mistral pulled
□ Python environment with transformers, torch, vllm
□ Garak installed for scanning
□ PyRIT installed for orchestration
□ Test target deployed (local chatbot with system prompt)
□ Logging infrastructure (save all inputs and outputs)