Availability — Denial of Service & Model Reliability
AI-Specific Availability Threats
Model Denial of Service
Crafted inputs that consume excessive compute resources:
- Context window stuffing: Sending maximum-length inputs to consume GPU memory
- Reasoning loops: Prompts that trigger expensive chain-of-thought processing
- Adversarial latency: Inputs specifically designed to maximize inference time
- Batch poisoning: Flooding batch processing queues with expensive requests
API Rate Limit Exhaustion
Legitimate-looking queries consuming all available capacity. Unlike traditional DDoS, each request is small but computationally expensive on the backend.
Model Drift
Performance degrades over time as the real-world data distribution shifts away from the training distribution. The model becomes less accurate without any explicit attack.
| Drift Type | Cause | Detection |
|---|---|---|
| Data drift | Input distribution changes | Statistical tests on input features |
| Concept drift | Relationship between inputs and correct outputs changes | Performance metric degradation |
| Feature drift | Specific input features shift in value or distribution | Feature-level monitoring |
Dependency Failure
Third-party model API outage. If your application depends on OpenAI, Anthropic, or another provider, their downtime is your downtime.
Compute Resource Exhaustion
GPU memory attacks, runaway inference costs, or legitimate traffic spikes that exceed provisioned capacity.
Controls
| Control | Purpose | Implementation |
|---|---|---|
| Rate limiting | Cap requests per user, API key, and IP | Token bucket, sliding window, per-endpoint limits |
| Input length limits | Prevent context window stuffing | Truncate or reject inputs exceeding token threshold |
| Timeout enforcement | Kill long-running inference | Hard timeout per request (e.g., 30 seconds max) |
| Circuit breakers | Automatic fallback when error rates spike | Trip at configurable error rate threshold |
| Multi-provider fallback | Reduce single-provider dependency | Route to backup model when primary is unavailable |
| Cost monitoring and alerting | Detect anomalous API spend | Budget alerts, per-user cost caps, anomaly detection |
| Load balancing | Distribute inference across endpoints | Round-robin or least-connections across GPU fleet |
| Response caching | Reduce redundant computation | Cache common query-response pairs |
| Drift monitoring | Detect performance degradation | Continuous evaluation on labeled test sets |
| Capacity planning | Ensure sufficient compute headroom | Load testing, traffic forecasting, auto-scaling |
SLA Considerations
When using third-party AI APIs, your SLA with customers can't exceed the SLA of your AI provider. Build contracts accordingly:
- Document AI provider SLA terms
- Define degraded-service mode when AI is unavailable
- Test fallback paths regularly
- Maintain a non-AI fallback for critical workflows