Testing & Exploitation

Test Execution Framework

Phase 1: System Prompt Extraction (30 min)

Run through extraction techniques in order of sophistication. Document the full extracted prompt.

Phase 2: Jailbreak Testing (2-4 hours)

Systematic testing against content restrictions:

  1. Identify restricted categories from the system prompt
  2. Test each category with escalating techniques
  3. Start with simple direct attempts
  4. Escalate to encoding, roleplay, multi-turn
  5. Document: technique used, exact prompts, success rate

Phase 3: Prompt Injection (2-4 hours)

Test every data input channel for injection:

ChannelTest Method
Direct user inputType injection payloads directly
RAG documentsUpload documents containing injection
Web contentIf AI browses, test with a controlled page containing injection
Tool outputsIf tools are available, test if tool output can contain injection
File uploadsEmbed instructions in uploaded files (PDFs, images with EXIF data)

Phase 4: Impact Demonstration (1-2 hours)

Prove real-world consequences:

  • Data exfiltration: Can the model leak system prompt, user data, or knowledge base content?
  • Unauthorized actions: Can you trigger tool calls the user didn't request?
  • Cross-user contamination: Can you affect other users' sessions?
  • Persistence: Can you modify the knowledge base or system behavior persistently?

Logging

Record everything:

  • Timestamp for each test
  • Exact input (copy-paste reproducible)
  • Model response (verbatim)
  • Success/failure classification
  • Notes on partial successes and potential escalation paths