Availability — Denial of Service & Model Reliability

AI-Specific Availability Threats

Model Denial of Service

Crafted inputs that consume excessive compute resources:

  • Context window stuffing: Sending maximum-length inputs to consume GPU memory
  • Reasoning loops: Prompts that trigger expensive chain-of-thought processing
  • Adversarial latency: Inputs specifically designed to maximize inference time
  • Batch poisoning: Flooding batch processing queues with expensive requests

API Rate Limit Exhaustion

Legitimate-looking queries consuming all available capacity. Unlike traditional DDoS, each request is small but computationally expensive on the backend.

Model Drift

Performance degrades over time as the real-world data distribution shifts away from the training distribution. The model becomes less accurate without any explicit attack.

Drift TypeCauseDetection
Data driftInput distribution changesStatistical tests on input features
Concept driftRelationship between inputs and correct outputs changesPerformance metric degradation
Feature driftSpecific input features shift in value or distributionFeature-level monitoring

Dependency Failure

Third-party model API outage. If your application depends on OpenAI, Anthropic, or another provider, their downtime is your downtime.

Compute Resource Exhaustion

GPU memory attacks, runaway inference costs, or legitimate traffic spikes that exceed provisioned capacity.

Controls

ControlPurposeImplementation
Rate limitingCap requests per user, API key, and IPToken bucket, sliding window, per-endpoint limits
Input length limitsPrevent context window stuffingTruncate or reject inputs exceeding token threshold
Timeout enforcementKill long-running inferenceHard timeout per request (e.g., 30 seconds max)
Circuit breakersAutomatic fallback when error rates spikeTrip at configurable error rate threshold
Multi-provider fallbackReduce single-provider dependencyRoute to backup model when primary is unavailable
Cost monitoring and alertingDetect anomalous API spendBudget alerts, per-user cost caps, anomaly detection
Load balancingDistribute inference across endpointsRound-robin or least-connections across GPU fleet
Response cachingReduce redundant computationCache common query-response pairs
Drift monitoringDetect performance degradationContinuous evaluation on labeled test sets
Capacity planningEnsure sufficient compute headroomLoad testing, traffic forecasting, auto-scaling

SLA Considerations

When using third-party AI APIs, your SLA with customers can't exceed the SLA of your AI provider. Build contracts accordingly:

  • Document AI provider SLA terms
  • Define degraded-service mode when AI is unavailable
  • Test fallback paths regularly
  • Maintain a non-AI fallback for critical workflows