Availability — Denial of Service & Model Reliability

AI-Specific Availability Threats

Model Denial of Service

Crafted inputs that consume excessive compute resources:

Context window stuffing: Sending maximum-length inputs to consume GPU memory
Reasoning loops: Prompts that trigger expensive chain-of-thought processing
Adversarial latency: Inputs specifically designed to maximize inference time
Batch poisoning: Flooding batch processing queues with expensive requests

API Rate Limit Exhaustion

Legitimate-looking queries consuming all available capacity. Unlike traditional DDoS, each request is small but computationally expensive on the backend.

Model Drift

Performance degrades over time as the real-world data distribution shifts away from the training distribution. The model becomes less accurate without any explicit attack.

Drift Type	Cause	Detection
Data drift	Input distribution changes	Statistical tests on input features
Concept drift	Relationship between inputs and correct outputs changes	Performance metric degradation
Feature drift	Specific input features shift in value or distribution	Feature-level monitoring

Dependency Failure

Third-party model API outage. If your application depends on OpenAI, Anthropic, or another provider, their downtime is your downtime.

Compute Resource Exhaustion

GPU memory attacks, runaway inference costs, or legitimate traffic spikes that exceed provisioned capacity.

Controls

Control	Purpose	Implementation
Rate limiting	Cap requests per user, API key, and IP	Token bucket, sliding window, per-endpoint limits
Input length limits	Prevent context window stuffing	Truncate or reject inputs exceeding token threshold
Timeout enforcement	Kill long-running inference	Hard timeout per request (e.g., 30 seconds max)
Circuit breakers	Automatic fallback when error rates spike	Trip at configurable error rate threshold
Multi-provider fallback	Reduce single-provider dependency	Route to backup model when primary is unavailable
Cost monitoring and alerting	Detect anomalous API spend	Budget alerts, per-user cost caps, anomaly detection
Load balancing	Distribute inference across endpoints	Round-robin or least-connections across GPU fleet
Response caching	Reduce redundant computation	Cache common query-response pairs
Drift monitoring	Detect performance degradation	Continuous evaluation on labeled test sets
Capacity planning	Ensure sufficient compute headroom	Load testing, traffic forecasting, auto-scaling

SLA Considerations

When using third-party AI APIs, your SLA with customers can't exceed the SLA of your AI provider. Build contracts accordingly:

Document AI provider SLA terms
Define degraded-service mode when AI is unavailable
Test fallback paths regularly
Maintain a non-AI fallback for critical workflows

AI Security Book