Engineering

10 min read

Traditional APM Cannot Track AI Errors. Here is What We Built Instead.

Why Sentry and Datadog fail for AI-specific errors like hallucinations, context overflows, and model degradation. Architecture of an AI-native error tracking system.

Transactional Team

Jan 28, 2026

{ }

10 min read

Traditional APM Cannot Track AI Errors. Here is What We Built Instead.

The Error That Was Not an Error

Consider a scenario many teams have encountered: users complain that an AI assistant is giving confidently wrong answers about pricing. Every request returns HTTP 200. Every response is valid JSON. Latency is normal. Token counts are reasonable.

Sentry has zero alerts. Datadog dashboards are green. From the perspective of traditional application monitoring, nothing is wrong.

The problem is that the LLM provider has silently updated model weights. The model's behavior shifts just enough that carefully tuned prompts start producing subtly incorrect outputs. Not errors. Not exceptions. Just wrong answers with high confidence.

This is the fundamental problem with AI error tracking: the most dangerous failures do not throw exceptions.

AI Error Types Traditional APM Misses

HallucinationsWrong answers, HTTP 200

Context OverflowSilent truncation

Model DriftGradual degradation

Cost Anomalies100x expected spend

Quality FailuresFormat/behavioral issues

Why Traditional APM Fails for AI

Application Performance Monitoring tools like Sentry, Datadog, and New Relic are built around a simple model: your code either succeeds or throws an exception. They track stack traces, error rates, and response codes.

AI applications break this model in several ways:

Hallucinations Are Not Exceptions

When a model generates a plausible-sounding but factually incorrect response, no error is thrown. The HTTP response is 200. The JSON is valid. The tokens were counted. By every traditional metric, the request succeeded.

Context Window Overflows Are Silent

When a conversation exceeds the model's context window, most providers silently truncate the oldest messages. Your application never receives an error. It just gets a response based on incomplete context. The user notices when the AI "forgets" something from earlier in the conversation.

Model Degradation is Gradual

Models change over time. Provider updates, fine-tuning drift, and prompt sensitivity mean that a perfectly working system can gradually degrade without any single point of failure. By the time it is noticeable, weeks of suboptimal responses have been served.

Rate Limits Are Not Bugs

A 429 response from an LLM provider is not a bug in your code. It is a resource constraint. Traditional error tracking treats all 4xx/5xx responses the same way. But a rate limit requires a different response (backoff and retry) than a 400 (fix the request) or a 500 (failover to another provider).

Cost Overruns Are Not Errors

A request that costs $0.50 instead of the expected $0.005 succeeds by every traditional metric. But it is absolutely a problem. Traditional APM has no concept of cost-per-request as an error signal.

What AI Error Tracking Needs

An AI-native error tracking system needs to track five categories of failure that traditional tools miss.

1. Response Quality Metrics

Instead of just checking whether a response was returned, we evaluate the quality of the response:

interface QualityAssessment {
  // Structural quality
  formatCompliance: boolean;      // Does it match expected format?
  lengthInRange: boolean;         // Within expected length bounds?
  languageCorrect: boolean;       // In the expected language?
 
  // Content quality (for tool-using agents)
  toolCallsValid: boolean;       // Are tool calls well-formed?
  toolParamsValid: boolean;      // Are parameters within allowed ranges?
 
  // Behavioral quality
  refusalDetected: boolean;      // Did the model refuse the request?
  repetitionDetected: boolean;   // Excessive repetition in output?
  truncationDetected: boolean;   // Response cut off mid-sentence?
}

Each response is scored. Scores below a threshold trigger an alert and log the response for review.

2. Semantic Drift Detection

Tracking how model outputs change over time for identical inputs is essential. Every week, a set of canonical prompts should be run through each model and compare the outputs against baseline responses:

interface DriftCheck {
  promptId: string;
  baselineResponse: string;
  currentResponse: string;
  similarityScore: number;       // Embedding similarity
  structuralDiff: string[];      // Key structural differences
  detectedAt: Date;
}
 
async function checkModelDrift(
  model: string,
  canonicalPrompts: CanonicalPrompt[]
): Promise<DriftCheck[]> {
  const results: DriftCheck[] = [];
 
  for (const prompt of canonicalPrompts) {
    const currentResponse = await callModel(model, prompt.messages);
    const similarity = await computeSimilarity(
      prompt.baselineResponse,
      currentResponse
    );
 
    if (similarity < prompt.driftThreshold) {
      results.push({
        promptId: prompt.id,
        baselineResponse: prompt.baselineResponse,
        currentResponse,
        similarityScore: similarity,
        structuralDiff: diffResponses(prompt.baselineResponse, currentResponse),
        detectedAt: new Date(),
      });
    }
  }
 
  return results;
}

When drift exceeds the threshold, we alert the team and provide a side-by-side diff of the baseline and current responses. This type of monitoring can catch model weight update incidents within 24 hours instead of the weeks it typically takes for user complaints to surface them.

3. Context Window Monitoring

We track context window utilization and detect when conversations are approaching or exceeding limits:

interface ContextWindowMetrics {
  modelMaxTokens: number;
  promptTokens: number;
  utilizationPercent: number;
  estimatedTruncation: boolean;
  truncatedMessageCount: number;
}
 
function assessContextWindow(
  model: string,
  messages: Message[],
  promptTokens: number
): ContextWindowMetrics {
  const maxTokens = getModelContextWindow(model);
  const utilization = promptTokens / maxTokens;
 
  return {
    modelMaxTokens: maxTokens,
    promptTokens,
    utilizationPercent: utilization * 100,
    estimatedTruncation: utilization > 0.95,
    truncatedMessageCount: utilization > 1.0
      ? estimateTruncatedMessages(messages, maxTokens)
      : 0,
  };
}

Alerting when context utilization exceeds 85%, giving teams time to implement summarization or message pruning before silent truncation occurs.

4. Cost Anomaly Detection

Every request has an expected cost based on the model and typical token counts. We flag requests that deviate significantly:

interface CostAnomaly {
  requestId: string;
  expectedCost: number;
  actualCost: number;
  deviationFactor: number;
  cause: "excessive_prompt" | "excessive_completion" | "wrong_model" | "retry_storm";
}
 
function detectCostAnomaly(trace: LLMTrace, baseline: CostBaseline): CostAnomaly | null {
  const deviation = trace.totalCostUsd / baseline.expectedCost;
 
  if (deviation > 5) {  // More than 5x expected cost
    return {
      requestId: trace.traceId,
      expectedCost: baseline.expectedCost,
      actualCost: trace.totalCostUsd,
      deviationFactor: deviation,
      cause: diagnoseCause(trace, baseline),
    };
  }
 
  return null;
}

5. Provider Health Tracking

Maintaining a real-time health score for each provider based on error rates, latency, and response quality is critical:

interface ProviderHealth {
  provider: string;
  healthScore: number;          // 0-100
  errorRate: number;            // Last 15 minutes
  p95Latency: number;           // Last 15 minutes
  rateLimitRate: number;        // 429s as percentage of requests
  qualityScore: number;         // Average response quality
  status: "healthy" | "degraded" | "down";
}

This feeds into failover decisions. If a provider's health score drops below 70, traffic can be automatically routed to backup providers before users notice degradation.

The Error Taxonomy

AI errors can be classified into a taxonomy that maps to specific response actions:

Error Class	Examples	Response
Hard failures	500, timeout, malformed response	Retry, then failover
Rate limits	429, quota exceeded	Backoff, queue, or failover
Content failures	Content filter triggered, refusal	Log, adjust prompt, or flag
Quality failures	Hallucination, drift, wrong format	Alert, review, retrain
Cost failures	Budget exceeded, anomalous cost	Throttle, alert, investigate
Context failures	Truncation, lost context	Summarize, prune, alert

Each class has different alerting thresholds, response automation, and escalation paths.

Alert Design

AI error alerts are different from traditional alerting. A single hallucination is not an incident. A pattern of hallucinations is.

Hard failure rate > 5% over 5 minutes      → Page on-call
Rate limit rate > 20% over 10 minutes       → Alert team channel
Quality score drop > 15% over 1 hour        → Alert AI team
Cost anomaly > 10x expected                  → Alert + auto-throttle
Context utilization > 85%                    → Warning to team
Model drift detected                        → Alert AI team + block deploy

Using sliding windows and requiring sustained issues before alerting is best practice. A single bad response does not page anyone. Five minutes of degraded quality does.

Dashboards

An effective AI error tracking dashboard shows:

Health Overview: Traffic light status for each provider and model. Green/yellow/red based on composite health score.

Quality Timeline: Response quality score over time, overlaid with deployment markers. This makes it immediately visible when a deploy or model update causes quality regression.

Error Breakdown: Errors categorized by the taxonomy above, with drill-down to individual requests. Each error includes the full prompt, response, and quality assessment.

Cost Monitor: Real-time spend versus budget, with anomaly markers. Projected monthly spend based on current trajectory.

Lessons Learned

Traditional APM is still necessary. AI error tracking supplements, not replaces, your existing monitoring. You still need to track HTTP errors, server health, and application exceptions.

Baseline quality before you can detect degradation. You need a week of quality data before drift detection is meaningful. Run canonical prompts from day one.

False positive tuning takes time. Initial quality scoring systems often flag 15% or more of legitimate responses as low-quality. It typically takes several iterations to get the false positive rate below 1%.

Users find quality issues before automated systems. Always have a feedback mechanism (thumbs up/down, report wrong answer) that feeds into your quality metrics. Automated detection catches systematic issues; user feedback catches individual failures.

Key Takeaway

The most dangerous AI failures are the ones your monitoring cannot see. If your error tracking only alerts on HTTP errors and exceptions, you are missing hallucinations, context overflow, model drift, and cost anomalies -- the failures that actually impact users.

Build monitoring that understands what AI errors look like. They are not stack traces. They are wrong answers with 200 status codes.

See our approach to AI error tracking at Error Tracking.

Sources & References

[1]Sentry Documentation - AI Monitoring — Sentry
[2]OpenTelemetry Semantic Conventions for GenAI — OpenTelemetry
[3]OpenAI API Error Codes — OpenAI
[4]Datadog LLM Observability — Datadog

Written by

Transactional Team

Tags:

error-tracking

monitoring

Industry Insights

We Evaluated 12 LLM Observability Tools. Most of Them Do Not Matter.

A practical evaluation of LLM observability tools across tracing, cost tracking, quality monitoring, and prompt management. What matters, what is marketing, and what to actually look for.

Transactional TeamMar 5, 2026

Case Studies

An Enterprise Team Was Shipping Hallucinations to Users. Traces Showed Them Where.

How an enterprise company with AI-powered customer support reduced hallucination rates from 8% to 0.3% and cut AI issue MTTR from days to minutes using LLM observability and trace-level analysis.

Transactional TeamMar 4, 2026

Tutorials

Your AI Agent Will Crash in Production. Plan for It.

Common AI agent failure modes and how to handle them: tool execution failures, context window overflow, infinite loops, and hallucinated function calls. Production-ready error patterns with code.

Transactional TeamMar 3, 2026

YOUR AGENTS DESERVE
REAL INFRASTRUCTURE.

START BUILDING AGENTS THAT DO REAL WORK.

Deploy Your First Agent