Model Monitoring & Drift Detection
What to Monitor
| Category | Metrics | Why |
|---|---|---|
| Performance | Accuracy, latency, error rate, throughput | Detect degradation before users notice |
| Data drift | Input feature distributions, token distributions | World changes → model gets stale |
| Output drift | Response length distribution, sentiment, refusal rate | Model behavior shifting over time |
| Safety | Toxicity rate, PII in outputs, jailbreak success rate | Safety guardrails weakening |
| Cost | Tokens per request, GPU utilization, API spend | Budget anomalies indicate abuse |
| Operational | Uptime, queue depth, timeout rate | Infrastructure health |
Drift Detection Methods
Statistical tests: Compare current input/output distributions against a reference baseline using KS test, PSI (Population Stability Index), or Jensen-Shannon divergence.
Performance benchmarks: Run a fixed evaluation set on a schedule. If accuracy drops below threshold, trigger alert.
Canary queries: Periodically send known-answer queries and verify correct responses. Functions like a health check for model quality.
Human evaluation sampling: Randomly sample a percentage of production outputs for human review. Track quality scores over time.
Alerting Thresholds
| Condition | Action |
|---|---|
| Accuracy drops >5% from baseline | Alert — investigate |
| Latency p99 exceeds 2x normal | Alert — check GPU health |
| PII detection rate spikes | Critical alert — potential data leakage |
| Refusal rate drops significantly | Alert — safety guardrails may be degraded |
| API cost exceeds daily budget by 2x | Alert — possible extraction or abuse |
| Error rate exceeds 5% | Alert — infrastructure issue |
Tools
| Tool | Purpose |
|---|---|
| Evidently AI | Open-source ML monitoring, drift detection |
| Arize | ML observability platform |
| WhyLabs | Data and model monitoring |
| Fiddler AI | Model performance management |
| Custom Prometheus/Grafana | Build your own with standard observability stack |