Testing & Exploitation
Test Execution Framework
Phase 1: System Prompt Extraction (30 min)
Run through extraction techniques in order of sophistication. Document the full extracted prompt.
Phase 2: Jailbreak Testing (2-4 hours)
Systematic testing against content restrictions:
- Identify restricted categories from the system prompt
- Test each category with escalating techniques
- Start with simple direct attempts
- Escalate to encoding, roleplay, multi-turn
- Document: technique used, exact prompts, success rate
Phase 3: Prompt Injection (2-4 hours)
Test every data input channel for injection:
| Channel | Test Method |
|---|---|
| Direct user input | Type injection payloads directly |
| RAG documents | Upload documents containing injection |
| Web content | If AI browses, test with a controlled page containing injection |
| Tool outputs | If tools are available, test if tool output can contain injection |
| File uploads | Embed instructions in uploaded files (PDFs, images with EXIF data) |
Phase 4: Impact Demonstration (1-2 hours)
Prove real-world consequences:
- Data exfiltration: Can the model leak system prompt, user data, or knowledge base content?
- Unauthorized actions: Can you trigger tool calls the user didn't request?
- Cross-user contamination: Can you affect other users' sessions?
- Persistence: Can you modify the knowledge base or system behavior persistently?
Logging
Record everything:
- Timestamp for each test
- Exact input (copy-paste reproducible)
- Model response (verbatim)
- Success/failure classification
- Notes on partial successes and potential escalation paths