Traditional APM Cannot Track AI Errors. Here is What We Built Instead.
Why Sentry and Datadog fail for AI-specific errors like hallucinations, context overflows, and model degradation. Architecture of an AI-native error tracking system.
Transactional Team
Jan 28, 2026
{ }
10 min read
Share
The Error That Was Not an Error
Consider a scenario many teams have encountered: users complain that an AI assistant is giving confidently wrong answers about pricing. Every request returns HTTP 200. Every response is valid JSON. Latency is normal. Token counts are reasonable.
Sentry has zero alerts. Datadog dashboards are green. From the perspective of traditional application monitoring, nothing is wrong.
The problem is that the LLM provider has silently updated model weights. The model's behavior shifts just enough that carefully tuned prompts start producing subtly incorrect outputs. Not errors. Not exceptions. Just wrong answers with high confidence.
This is the fundamental problem with AI error tracking: the most dangerous failures do not throw exceptions.
AI Error Types Traditional APM Misses
HallucinationsWrong answers, HTTP 200
Context OverflowSilent truncation
Model DriftGradual degradation
Cost Anomalies100x expected spend
Quality FailuresFormat/behavioral issues
Why Traditional APM Fails for AI
Application Performance Monitoring tools like Sentry, Datadog, and New Relic are built around a simple model: your code either succeeds or throws an exception. They track stack traces, error rates, and response codes.
AI applications break this model in several ways:
Hallucinations Are Not Exceptions
When a model generates a plausible-sounding but factually incorrect response, no error is thrown. The HTTP response is 200. The JSON is valid. The tokens were counted. By every traditional metric, the request succeeded.
Context Window Overflows Are Silent
When a conversation exceeds the model's context window, most providers silently truncate the oldest messages. Your application never receives an error. It just gets a response based on incomplete context. The user notices when the AI "forgets" something from earlier in the conversation.
Model Degradation is Gradual
Models change over time. Provider updates, fine-tuning drift, and prompt sensitivity mean that a perfectly working system can gradually degrade without any single point of failure. By the time it is noticeable, weeks of suboptimal responses have been served.
Rate Limits Are Not Bugs
A 429 response from an LLM provider is not a bug in your code. It is a resource constraint. Traditional error tracking treats all 4xx/5xx responses the same way. But a rate limit requires a different response (backoff and retry) than a 400 (fix the request) or a 500 (failover to another provider).
Cost Overruns Are Not Errors
A request that costs $0.50 instead of the expected $0.005 succeeds by every traditional metric. But it is absolutely a problem. Traditional APM has no concept of cost-per-request as an error signal.
What AI Error Tracking Needs
An AI-native error tracking system needs to track five categories of failure that traditional tools miss.
1. Response Quality Metrics
Instead of just checking whether a response was returned, we evaluate the quality of the response:
interface QualityAssessment { // Structural quality formatCompliance: boolean; // Does it match expected format? lengthInRange: boolean; // Within expected length bounds? languageCorrect: boolean; // In the expected language? // Content quality (for tool-using agents) toolCallsValid: boolean; // Are tool calls well-formed? toolParamsValid: boolean; // Are parameters within allowed ranges? // Behavioral quality refusalDetected: boolean; // Did the model refuse the request? repetitionDetected: boolean; // Excessive repetition in output? truncationDetected: boolean; // Response cut off mid-sentence?}
Each response is scored. Scores below a threshold trigger an alert and log the response for review.
2. Semantic Drift Detection
Tracking how model outputs change over time for identical inputs is essential. Every week, a set of canonical prompts should be run through each model and compare the outputs against baseline responses:
When drift exceeds the threshold, we alert the team and provide a side-by-side diff of the baseline and current responses. This type of monitoring can catch model weight update incidents within 24 hours instead of the weeks it typically takes for user complaints to surface them.
3. Context Window Monitoring
We track context window utilization and detect when conversations are approaching or exceeding limits:
Maintaining a real-time health score for each provider based on error rates, latency, and response quality is critical:
interface ProviderHealth { provider: string; healthScore: number; // 0-100 errorRate: number; // Last 15 minutes p95Latency: number; // Last 15 minutes rateLimitRate: number; // 429s as percentage of requests qualityScore: number; // Average response quality status: "healthy" | "degraded" | "down";}
This feeds into failover decisions. If a provider's health score drops below 70, traffic can be automatically routed to backup providers before users notice degradation.
The Error Taxonomy
AI errors can be classified into a taxonomy that maps to specific response actions:
Error Class
Examples
Response
Hard failures
500, timeout, malformed response
Retry, then failover
Rate limits
429, quota exceeded
Backoff, queue, or failover
Content failures
Content filter triggered, refusal
Log, adjust prompt, or flag
Quality failures
Hallucination, drift, wrong format
Alert, review, retrain
Cost failures
Budget exceeded, anomalous cost
Throttle, alert, investigate
Context failures
Truncation, lost context
Summarize, prune, alert
Each class has different alerting thresholds, response automation, and escalation paths.
Alert Design
AI error alerts are different from traditional alerting. A single hallucination is not an incident. A pattern of hallucinations is.
Hard failure rate > 5% over 5 minutes → Page on-call
Rate limit rate > 20% over 10 minutes → Alert team channel
Quality score drop > 15% over 1 hour → Alert AI team
Cost anomaly > 10x expected → Alert + auto-throttle
Context utilization > 85% → Warning to team
Model drift detected → Alert AI team + block deploy
Using sliding windows and requiring sustained issues before alerting is best practice. A single bad response does not page anyone. Five minutes of degraded quality does.
Dashboards
An effective AI error tracking dashboard shows:
Health Overview: Traffic light status for each provider and model. Green/yellow/red based on composite health score.
Quality Timeline: Response quality score over time, overlaid with deployment markers. This makes it immediately visible when a deploy or model update causes quality regression.
Error Breakdown: Errors categorized by the taxonomy above, with drill-down to individual requests. Each error includes the full prompt, response, and quality assessment.
Cost Monitor: Real-time spend versus budget, with anomaly markers. Projected monthly spend based on current trajectory.
Lessons Learned
Traditional APM is still necessary. AI error tracking supplements, not replaces, your existing monitoring. You still need to track HTTP errors, server health, and application exceptions.
Baseline quality before you can detect degradation. You need a week of quality data before drift detection is meaningful. Run canonical prompts from day one.
False positive tuning takes time. Initial quality scoring systems often flag 15% or more of legitimate responses as low-quality. It typically takes several iterations to get the false positive rate below 1%.
Users find quality issues before automated systems. Always have a feedback mechanism (thumbs up/down, report wrong answer) that feeds into your quality metrics. Automated detection catches systematic issues; user feedback catches individual failures.
Key Takeaway
The most dangerous AI failures are the ones your monitoring cannot see. If your error tracking only alerts on HTTP errors and exceptions, you are missing hallucinations, context overflow, model drift, and cost anomalies -- the failures that actually impact users.
Build monitoring that understands what AI errors look like. They are not stack traces. They are wrong answers with 200 status codes.
See our approach to AI error tracking at Error Tracking.