Engineering
10 min read

Traditional APM Cannot Track AI Errors. Here is What We Built Instead.

Why Sentry and Datadog fail for AI-specific errors like hallucinations, context overflows, and model degradation. Architecture of an AI-native error tracking system.

Transactional Team
Jan 28, 2026
10 min read
Share
Traditional APM Cannot Track AI Errors. Here is What We Built Instead.

The Error That Was Not an Error

Consider a scenario many teams have encountered: users complain that an AI assistant is giving confidently wrong answers about pricing. Every request returns HTTP 200. Every response is valid JSON. Latency is normal. Token counts are reasonable.

Sentry has zero alerts. Datadog dashboards are green. From the perspective of traditional application monitoring, nothing is wrong.

The problem is that the LLM provider has silently updated model weights. The model's behavior shifts just enough that carefully tuned prompts start producing subtly incorrect outputs. Not errors. Not exceptions. Just wrong answers with high confidence.

This is the fundamental problem with AI error tracking: the most dangerous failures do not throw exceptions.

AI Error Types Traditional APM Misses

HallucinationsWrong answers, HTTP 200
Context OverflowSilent truncation
Model DriftGradual degradation
Cost Anomalies100x expected spend
Quality FailuresFormat/behavioral issues

Why Traditional APM Fails for AI

Application Performance Monitoring tools like Sentry, Datadog, and New Relic are built around a simple model: your code either succeeds or throws an exception. They track stack traces, error rates, and response codes.

AI applications break this model in several ways:

Hallucinations Are Not Exceptions

When a model generates a plausible-sounding but factually incorrect response, no error is thrown. The HTTP response is 200. The JSON is valid. The tokens were counted. By every traditional metric, the request succeeded.

Context Window Overflows Are Silent

When a conversation exceeds the model's context window, most providers silently truncate the oldest messages. Your application never receives an error. It just gets a response based on incomplete context. The user notices when the AI "forgets" something from earlier in the conversation.

Model Degradation is Gradual

Models change over time. Provider updates, fine-tuning drift, and prompt sensitivity mean that a perfectly working system can gradually degrade without any single point of failure. By the time it is noticeable, weeks of suboptimal responses have been served.

Rate Limits Are Not Bugs

A 429 response from an LLM provider is not a bug in your code. It is a resource constraint. Traditional error tracking treats all 4xx/5xx responses the same way. But a rate limit requires a different response (backoff and retry) than a 400 (fix the request) or a 500 (failover to another provider).

Cost Overruns Are Not Errors

A request that costs $0.50 instead of the expected $0.005 succeeds by every traditional metric. But it is absolutely a problem. Traditional APM has no concept of cost-per-request as an error signal.

What AI Error Tracking Needs

An AI-native error tracking system needs to track five categories of failure that traditional tools miss.

1. Response Quality Metrics

Instead of just checking whether a response was returned, we evaluate the quality of the response:

interface QualityAssessment {
  // Structural quality
  formatCompliance: boolean;      // Does it match expected format?
  lengthInRange: boolean;         // Within expected length bounds?
  languageCorrect: boolean;       // In the expected language?
 
  // Content quality (for tool-using agents)
  toolCallsValid: boolean;       // Are tool calls well-formed?
  toolParamsValid: boolean;      // Are parameters within allowed ranges?
 
  // Behavioral quality
  refusalDetected: boolean;      // Did the model refuse the request?
  repetitionDetected: boolean;   // Excessive repetition in output?
  truncationDetected: boolean;   // Response cut off mid-sentence?
}

Each response is scored. Scores below a threshold trigger an alert and log the response for review.

2. Semantic Drift Detection

Tracking how model outputs change over time for identical inputs is essential. Every week, a set of canonical prompts should be run through each model and compare the outputs against baseline responses:

interface DriftCheck {
  promptId: string;
  baselineResponse: string;
  currentResponse: string;
  similarityScore: number;       // Embedding similarity
  structuralDiff: string[];      // Key structural differences
  detectedAt: Date;
}
 
async function checkModelDrift(
  model: string,
  canonicalPrompts: CanonicalPrompt[]
): Promise<DriftCheck[]> {
  const results: DriftCheck[] = [];
 
  for (const prompt of canonicalPrompts) {
    const currentResponse = await callModel(model, prompt.messages);
    const similarity = await computeSimilarity(
      prompt.baselineResponse,
      currentResponse
    );
 
    if (similarity < prompt.driftThreshold) {
      results.push({
        promptId: prompt.id,
        baselineResponse: prompt.baselineResponse,
        currentResponse,
        similarityScore: similarity,
        structuralDiff: diffResponses(prompt.baselineResponse, currentResponse),
        detectedAt: new Date(),
      });
    }
  }
 
  return results;
}

When drift exceeds the threshold, we alert the team and provide a side-by-side diff of the baseline and current responses. This type of monitoring can catch model weight update incidents within 24 hours instead of the weeks it typically takes for user complaints to surface them.

3. Context Window Monitoring

We track context window utilization and detect when conversations are approaching or exceeding limits:

interface ContextWindowMetrics {
  modelMaxTokens: number;
  promptTokens: number;
  utilizationPercent: number;
  estimatedTruncation: boolean;
  truncatedMessageCount: number;
}
 
function assessContextWindow(
  model: string,
  messages: Message[],
  promptTokens: number
): ContextWindowMetrics {
  const maxTokens = getModelContextWindow(model);
  const utilization = promptTokens / maxTokens;
 
  return {
    modelMaxTokens: maxTokens,
    promptTokens,
    utilizationPercent: utilization * 100,
    estimatedTruncation: utilization > 0.95,
    truncatedMessageCount: utilization > 1.0
      ? estimateTruncatedMessages(messages, maxTokens)
      : 0,
  };
}

Alerting when context utilization exceeds 85%, giving teams time to implement summarization or message pruning before silent truncation occurs.

4. Cost Anomaly Detection

Every request has an expected cost based on the model and typical token counts. We flag requests that deviate significantly:

interface CostAnomaly {
  requestId: string;
  expectedCost: number;
  actualCost: number;
  deviationFactor: number;
  cause: "excessive_prompt" | "excessive_completion" | "wrong_model" | "retry_storm";
}
 
function detectCostAnomaly(trace: LLMTrace, baseline: CostBaseline): CostAnomaly | null {
  const deviation = trace.totalCostUsd / baseline.expectedCost;
 
  if (deviation > 5) {  // More than 5x expected cost
    return {
      requestId: trace.traceId,
      expectedCost: baseline.expectedCost,
      actualCost: trace.totalCostUsd,
      deviationFactor: deviation,
      cause: diagnoseCause(trace, baseline),
    };
  }
 
  return null;
}

5. Provider Health Tracking

Maintaining a real-time health score for each provider based on error rates, latency, and response quality is critical:

interface ProviderHealth {
  provider: string;
  healthScore: number;          // 0-100
  errorRate: number;            // Last 15 minutes
  p95Latency: number;           // Last 15 minutes
  rateLimitRate: number;        // 429s as percentage of requests
  qualityScore: number;         // Average response quality
  status: "healthy" | "degraded" | "down";
}

This feeds into failover decisions. If a provider's health score drops below 70, traffic can be automatically routed to backup providers before users notice degradation.

The Error Taxonomy

AI errors can be classified into a taxonomy that maps to specific response actions:

Error ClassExamplesResponse
Hard failures500, timeout, malformed responseRetry, then failover
Rate limits429, quota exceededBackoff, queue, or failover
Content failuresContent filter triggered, refusalLog, adjust prompt, or flag
Quality failuresHallucination, drift, wrong formatAlert, review, retrain
Cost failuresBudget exceeded, anomalous costThrottle, alert, investigate
Context failuresTruncation, lost contextSummarize, prune, alert

Each class has different alerting thresholds, response automation, and escalation paths.

Alert Design

AI error alerts are different from traditional alerting. A single hallucination is not an incident. A pattern of hallucinations is.

Hard failure rate > 5% over 5 minutes      → Page on-call
Rate limit rate > 20% over 10 minutes       → Alert team channel
Quality score drop > 15% over 1 hour        → Alert AI team
Cost anomaly > 10x expected                  → Alert + auto-throttle
Context utilization > 85%                    → Warning to team
Model drift detected                        → Alert AI team + block deploy

Using sliding windows and requiring sustained issues before alerting is best practice. A single bad response does not page anyone. Five minutes of degraded quality does.

Dashboards

An effective AI error tracking dashboard shows:

Health Overview: Traffic light status for each provider and model. Green/yellow/red based on composite health score.

Quality Timeline: Response quality score over time, overlaid with deployment markers. This makes it immediately visible when a deploy or model update causes quality regression.

Error Breakdown: Errors categorized by the taxonomy above, with drill-down to individual requests. Each error includes the full prompt, response, and quality assessment.

Cost Monitor: Real-time spend versus budget, with anomaly markers. Projected monthly spend based on current trajectory.

Lessons Learned

Traditional APM is still necessary. AI error tracking supplements, not replaces, your existing monitoring. You still need to track HTTP errors, server health, and application exceptions.

Baseline quality before you can detect degradation. You need a week of quality data before drift detection is meaningful. Run canonical prompts from day one.

False positive tuning takes time. Initial quality scoring systems often flag 15% or more of legitimate responses as low-quality. It typically takes several iterations to get the false positive rate below 1%.

Users find quality issues before automated systems. Always have a feedback mechanism (thumbs up/down, report wrong answer) that feeds into your quality metrics. Automated detection catches systematic issues; user feedback catches individual failures.

Key Takeaway

The most dangerous AI failures are the ones your monitoring cannot see. If your error tracking only alerts on HTTP errors and exceptions, you are missing hallucinations, context overflow, model drift, and cost anomalies -- the failures that actually impact users.

Build monitoring that understands what AI errors look like. They are not stack traces. They are wrong answers with 200 status codes.

See our approach to AI error tracking at Error Tracking.

Written by

Transactional Team

Share
Tags:
error-tracking
ai
monitoring

YOUR AGENTS DESERVE
REAL INFRASTRUCTURE.

START BUILDING AGENTS THAT DO REAL WORK.

Deploy Your First Agent