Responsible Disclosure for AI Vulnerabilities
Why AI Disclosure Is Different
Traditional vulnerability disclosure has mature processes — CVEs, CVSS scoring, coordinated disclosure timelines. AI vulnerability disclosure is still immature, and several factors make it harder:
- No CVE equivalent. There's no standardized identifier system for AI vulnerabilities. A prompt injection affecting GPT-4 doesn't get a CVE.
- Reproducibility is probabilistic. The same jailbreak prompt might work 60% of the time. Traditional vulns are deterministic — they either work or they don't.
- The "fix" is unclear. Patching a prompt injection isn't like patching a buffer overflow. It may require retraining, fine-tuning, or filter updates — and the fix may break other behavior.
- Severity is subjective. A jailbreak that produces mildly inappropriate text and one that exfiltrates user data are both "prompt injection" but have vastly different impact.
- Disclosure can become the exploit. Publishing a jailbreak template doesn't require adaptation — anyone can copy-paste it. Traditional exploits usually need targeting.
Vendor Disclosure Programs
Major AI Providers
| Provider | Program | URL | Scope |
|---|---|---|---|
| OpenAI | Bug Bounty (via Bugcrowd) | bugcrowd.com/openai | API vulnerabilities, data exposure. Jailbreaks/safety bypasses NOT in scope for bounty but can be reported. |
| Anthropic | Responsible Disclosure | anthropic.com/responsible-disclosure | Security vulnerabilities in systems and infrastructure. Safety issues reported through separate channels. |
| Google (DeepMind) | Google VRP | bughunters.google.com | AI-specific vulnerabilities in Google products. Includes model manipulation, training data extraction. |
| Meta | Bug Bounty + AI Red Team | facebook.com/whitehat | Llama model vulnerabilities, platform AI features. |
| Microsoft | MSRC + AI Red Team | msrc.microsoft.com | Copilot, Azure AI, Bing AI vulnerabilities. |
| Hugging Face | Security reporting | huggingface.co/security | Model hub vulnerabilities, malicious models, infrastructure issues. |
What's Typically In Scope
| Category | Usually In Scope | Usually Out of Scope |
|---|---|---|
| Infrastructure vulns | Yes — SSRF, auth bypass, data exposure | |
| Training data extraction | Yes — PII or sensitive data recovered | General memorization without sensitive content |
| Cross-user data leakage | Yes — accessing another user's data | |
| System prompt extraction | Varies — some treat as informational | Often out of scope for bounty |
| Jailbreaks | Usually out of scope for bounty | Can be reported for safety team review |
| Model output quality | No | Hallucinations, factual errors |
| Bias | No (for bug bounty) | Report through responsible AI channels |
How to Report
Step 1: Classify the Finding
| Classification | Description | Urgency |
|---|---|---|
| Security vulnerability | Infrastructure exploit, data exposure, auth bypass | Report immediately via security channel |
| Safety bypass with impact | Jailbreak that enables harmful actions (tool abuse, data exfil) | Report within 24-48 hours |
| Safety bypass without impact | Jailbreak that produces restricted text only | Report at your convenience |
| Prompt injection (indirect) | Third-party content can hijack model behavior | Report within 48 hours — higher impact |
| Model behavior issue | Bias, hallucination, quality degradation | Report through product feedback channels |
Step 2: Document the Finding
Include in your report:
## Summary
[One sentence: what the vulnerability is and why it matters]
## Affected System
[Model name, version if known, API or web interface, specific feature]
## Reproduction Steps
1. [Exact steps to reproduce]
2. [Include exact prompts — copy-paste ready]
3. [Note any required preconditions]
## Observed Behavior
[What the model did — include exact output if possible]
## Expected Behavior
[What the model should have done]
## Reproduction Rate
[Approximate percentage: "works ~70% of the time across 20 attempts"]
## Impact Assessment
[What an attacker could achieve with this vulnerability]
[Data at risk, unauthorized actions possible, affected users]
## Suggested Mitigation
[If you have ideas for how to fix it — optional but appreciated]
## Environment
[Date/time of testing, browser/API client used, account type]
Step 3: Submit Through the Right Channel
- Security vulnerabilities: Use the vendor's security reporting page, not public forums
- Safety issues: Use the dedicated safety reporting mechanism if available
- No response in 5 business days: Send a follow-up. If no response in 15 business days, consider escalating through CERT/CC or the AI Incident Database
Step 4: Coordinate Disclosure
- Follow the vendor's stated disclosure timeline (typically 90 days)
- For AI vulns, consider longer timelines — fixes may require retraining
- Don't publish working jailbreak prompts before the vendor has had time to respond
- If publishing research, consider redacting the specific bypass technique while describing the vulnerability class
Disclosure Dos and Don'ts
Do:
- Report through official channels first
- Provide clear reproduction steps
- Assess and communicate real-world impact
- Give the vendor reasonable time to respond
- Document everything for your records
Don't:
- Test on production systems beyond what's needed to confirm the issue
- Access, store, or exfiltrate other users' data during testing
- Publish working exploits before coordinated disclosure
- Overstate severity — "I jailbroke ChatGPT" is different from "I extracted user data"
- Threaten the vendor or demand payment outside of formal bug bounty programs
For Organizations: Building Your Own AI Disclosure Program
If you deploy AI-powered products, you need a process for receiving AI vulnerability reports:
Minimum Requirements
- Dedicated intake channel — separate from traditional security bugs. AI reports need reviewers who understand prompt injection, not just web app vulns.
- Defined scope — clearly state what's in scope (infrastructure, data leakage, injection) and what's not (jailbreaks that only produce text, hallucinations).
- Response SLA — acknowledge receipt within 48 hours, triage within 5 business days.
- AI-specific severity framework — traditional CVSS doesn't capture AI risks well. Define your own:
| Severity | Criteria |
|---|---|
| Critical | Data exfiltration, unauthorized actions, cross-user impact |
| High | Reliable system prompt extraction with credentials, persistent injection |
| Medium | System prompt extraction (no creds), inconsistent jailbreak with tool abuse |
| Low | Jailbreak producing restricted text, information disclosure without sensitive data |
| Informational | Theoretical risk, defense recommendations |
- Remediation process — define who triages AI reports, how fixes are tested, and what "fixed" means (is a filter patch sufficient, or does this need retraining?).
Industry Resources
- AI Incident Database (AIID): Tracks real-world AI failures and incidents — useful for understanding impact patterns
- AVID (AI Vulnerability Database): Community effort to catalog AI vulnerabilities with structured reports
- MITRE ATLAS: Use ATLAS technique IDs in your reports for standardized classification
- OWASP LLM Top 10: Reference for categorizing findings