Tutorials

10 min read

Your AI Agent Will Crash in Production. Plan for It.

Common AI agent failure modes and how to handle them: tool execution failures, context window overflow, infinite loops, and hallucinated function calls. Production-ready error patterns with code.

Transactional Team

Mar 3, 2026

10 min read

Your AI Agent Will Crash in Production. Plan for It.

Consider a common production scenario: an AI agent works perfectly in testing, then crashes within hours of deployment.

The failure is simple: the agent calls a tool that returns a 50KB JSON response. The agent includes that response in its next prompt, which pushes the conversation past the context window limit. The API returns a 400 error. The agent's error handler retries the same request. Same error. Infinite loop. Dead.

AI agents fail in ways traditional software does not. The failure modes are novel, the debugging is harder, and the consequences range from wasted money to incorrect actions taken on behalf of users. Here is how to plan for it.

What You Will Learn

The five most common AI agent failure modes
Error boundaries that prevent cascading failures
Graceful degradation patterns
Logging and alerting for agent-specific issues
Production-ready code for each pattern

AI Agent Failure Modes in Production

38%Tool Execution Failures

24%Context Window Overflow

18%Infinite Loops

12%Hallucinated Calls

8%Cascading Failures

Failure Mode 1: Tool Execution Failures

The agent decides to call a tool. The tool fails. Now what?

Most agent frameworks just pass the error back to the LLM and hope it figures out what to do. Sometimes it does. Often it retries the exact same failing call, or hallucinates a workaround that makes things worse.

// tool-executor.ts
interface ToolResult {
  success: boolean;
  data?: unknown;
  error?: string;
}
 
class SafeToolExecutor {
  private maxToolRetries = 2;
 
  async executeTool(
    toolName: string,
    args: Record<string, unknown>
  ): Promise<ToolResult> {
    for (let attempt = 0; attempt <= this.maxToolRetries; attempt++) {
      try {
        const result = await this.runTool(toolName, args);
 
        // Validate the result is not too large for context
        const resultSize = JSON.stringify(result).length;
        if (resultSize > 10_000) {
          return {
            success: true,
            data: this.truncateResult(result, 10_000),
          };
        }
 
        return { success: true, data: result };
      } catch (error) {
        const err = error as Error;
 
        // Non-retryable errors
        if (this.isNonRetryable(err)) {
          return {
            success: false,
            error: `Tool "${toolName}" failed: ${err.message}. Do not retry this tool.`,
          };
        }
 
        // Last attempt
        if (attempt === this.maxToolRetries) {
          return {
            success: false,
            error: `Tool "${toolName}" failed after ${attempt + 1} attempts: ${err.message}. Try a different approach.`,
          };
        }
 
        // Wait before retry
        await new Promise((r) =>
          setTimeout(r, 1000 * (attempt + 1))
        );
      }
    }
 
    return { success: false, error: 'Unexpected execution path' };
  }
 
  private isNonRetryable(error: Error): boolean {
    return (
      error.message.includes('not found') ||
      error.message.includes('permission denied') ||
      error.message.includes('invalid argument') ||
      error.message.includes('401') ||
      error.message.includes('403') ||
      error.message.includes('404')
    );
  }
 
  private truncateResult(
    result: unknown,
    maxLength: number
  ): unknown {
    const json = JSON.stringify(result);
    if (json.length <= maxLength) return result;
 
    // For arrays, return first N items
    if (Array.isArray(result)) {
      const truncated = [];
      let currentLength = 2; // []
      for (const item of result) {
        const itemJson = JSON.stringify(item);
        if (currentLength + itemJson.length + 1 > maxLength) break;
        truncated.push(item);
        currentLength += itemJson.length + 1; // +1 for comma
      }
      return truncated;
    }
 
    // For objects, return a summary
    return {
      _truncated: true,
      _originalSize: json.length,
      _preview: json.substring(0, maxLength),
    };
  }
 
  private async runTool(
    name: string,
    args: Record<string, unknown>
  ): Promise<unknown> {
    // Your actual tool execution logic
    throw new Error('Not implemented');
  }
}

The key decisions: truncate large results before they hit the context window, distinguish retryable from non-retryable errors, and give the LLM clear instructions about what happened.

Failure Mode 2: Context Window Overflow

Every LLM has a context limit. As conversations grow, tool results accumulate, and the context fills up. When you exceed the limit, the API rejects the request.

// context-manager.ts
interface Message {
  role: 'system' | 'user' | 'assistant' | 'tool';
  content: string;
  tokens?: number;
}
 
class ContextManager {
  private maxTokens: number;
  private reservedForResponse: number;
  private messages: Message[] = [];
 
  constructor(
    maxContextTokens: number,
    reservedForResponse: number = 4096
  ) {
    this.maxTokens = maxContextTokens;
    this.reservedForResponse = reservedForResponse;
  }
 
  addMessage(message: Message): void {
    message.tokens = this.estimateTokens(message.content);
    this.messages.push(message);
    this.pruneIfNeeded();
  }
 
  getMessages(): Message[] {
    return this.messages;
  }
 
  private pruneIfNeeded(): void {
    const totalTokens = this.getTotalTokens();
    const available = this.maxTokens - this.reservedForResponse;
 
    if (totalTokens <= available) return;
 
    // Strategy: Keep system message + last N messages
    // Summarize removed messages
    const systemMessages = this.messages.filter(
      (m) => m.role === 'system'
    );
    const otherMessages = this.messages.filter(
      (m) => m.role !== 'system'
    );
 
    // Remove oldest non-system messages until under limit
    let currentTokens = systemMessages.reduce(
      (sum, m) => sum + (m.tokens ?? 0),
      0
    );
    const kept: Message[] = [];
 
    // Work backwards from most recent
    for (let i = otherMessages.length - 1; i >= 0; i--) {
      const msg = otherMessages[i];
      if (
        currentTokens + (msg.tokens ?? 0) >
        available
      ) {
        break;
      }
      kept.unshift(msg);
      currentTokens += msg.tokens ?? 0;
    }
 
    // Add a summary of dropped messages
    const droppedCount = otherMessages.length - kept.length;
    if (droppedCount > 0) {
      const summaryMsg: Message = {
        role: 'system',
        content: `[${droppedCount} earlier messages were pruned from context to stay within limits. The conversation continues from the remaining messages.]`,
      };
      summaryMsg.tokens = this.estimateTokens(summaryMsg.content);
      this.messages = [...systemMessages, summaryMsg, ...kept];
    } else {
      this.messages = [...systemMessages, ...kept];
    }
  }
 
  private getTotalTokens(): number {
    return this.messages.reduce(
      (sum, m) => sum + (m.tokens ?? 0),
      0
    );
  }
 
  private estimateTokens(text: string): number {
    // Rough estimate: 1 token per 4 characters
    return Math.ceil(text.length / 4);
  }
}

Failure Mode 3: Infinite Loops

The agent gets stuck in a loop: call tool, get result, decide to call the same tool again with the same arguments. This burns tokens and money.

// loop-detector.ts
interface ToolCall {
  name: string;
  args: string; // JSON-stringified arguments
  timestamp: number;
}
 
class LoopDetector {
  private history: ToolCall[] = [];
  private maxConsecutiveIdentical = 2;
  private maxTotalCalls = 25;
  private windowMs = 60_000; // 1 minute window
 
  recordCall(name: string, args: Record<string, unknown>): void {
    this.history.push({
      name,
      args: JSON.stringify(args),
      timestamp: Date.now(),
    });
  }
 
  isLooping(): { looping: boolean; reason?: string } {
    // Check total call limit
    if (this.history.length > this.maxTotalCalls) {
      return {
        looping: true,
        reason: `Agent exceeded ${this.maxTotalCalls} total tool calls`,
      };
    }
 
    // Check for consecutive identical calls
    if (this.history.length >= this.maxConsecutiveIdentical) {
      const recent = this.history.slice(
        -this.maxConsecutiveIdentical
      );
      const allSame = recent.every(
        (call) =>
          call.name === recent[0].name &&
          call.args === recent[0].args
      );
 
      if (allSame) {
        return {
          looping: true,
          reason: `Agent called ${recent[0].name} with identical arguments ${this.maxConsecutiveIdentical} times`,
        };
      }
    }
 
    // Check for rapid-fire calls (rate limiting)
    const recentWindow = this.history.filter(
      (c) => c.timestamp > Date.now() - this.windowMs
    );
    if (recentWindow.length > 15) {
      return {
        looping: true,
        reason: `Agent made ${recentWindow.length} calls in ${this.windowMs / 1000}s`,
      };
    }
 
    return { looping: false };
  }
 
  reset(): void {
    this.history = [];
  }
}

Failure Mode 4: Hallucinated Function Calls

The agent calls a tool that does not exist, or passes arguments in the wrong format. This happens more often than you would think, especially with complex tool schemas.

// tool-validator.ts
interface ToolSchema {
  name: string;
  parameters: Record<
    string,
    {
      type: string;
      required?: boolean;
      enum?: string[];
    }
  >;
}
 
class ToolValidator {
  private schemas: Map<string, ToolSchema>;
 
  constructor(schemas: ToolSchema[]) {
    this.schemas = new Map(schemas.map((s) => [s.name, s]));
  }
 
  validate(
    toolName: string,
    args: Record<string, unknown>
  ): { valid: boolean; error?: string } {
    const schema = this.schemas.get(toolName);
 
    if (!schema) {
      const available = Array.from(this.schemas.keys()).join(', ');
      return {
        valid: false,
        error: `Tool "${toolName}" does not exist. Available tools: ${available}`,
      };
    }
 
    // Check required parameters
    for (const [paramName, paramSchema] of Object.entries(
      schema.parameters
    )) {
      if (paramSchema.required && !(paramName in args)) {
        return {
          valid: false,
          error: `Missing required parameter "${paramName}" for tool "${toolName}"`,
        };
      }
 
      if (paramName in args) {
        const value = args[paramName];
        // Type checking
        if (
          paramSchema.type === 'string' &&
          typeof value !== 'string'
        ) {
          return {
            valid: false,
            error: `Parameter "${paramName}" must be a string, got ${typeof value}`,
          };
        }
        // Enum checking
        if (
          paramSchema.enum &&
          !paramSchema.enum.includes(value as string)
        ) {
          return {
            valid: false,
            error: `Parameter "${paramName}" must be one of: ${paramSchema.enum.join(', ')}`,
          };
        }
      }
    }
 
    return { valid: true };
  }
}

Failure Mode 5: Cascading Failures

One failure triggers another. The agent fails to read a file, so it guesses the contents. The guess is wrong, so it writes incorrect data. The incorrect data causes downstream errors.

Error Boundaries

Wrap agent execution in error boundaries that limit the blast radius:

// agent-runner.ts
interface AgentRunResult {
  success: boolean;
  response?: string;
  error?: string;
  toolCalls: number;
  tokensUsed: number;
  durationMs: number;
}
 
async function runAgentSafely(
  agent: Agent,
  input: string,
  options: {
    maxDurationMs?: number;
    maxToolCalls?: number;
    maxTokens?: number;
    onError?: (error: Error) => void;
  } = {}
): Promise<AgentRunResult> {
  const {
    maxDurationMs = 120_000, // 2 minutes
    maxToolCalls = 25,
    maxTokens = 100_000,
  } = options;
 
  const startTime = Date.now();
  const loopDetector = new LoopDetector();
  const contextManager = new ContextManager(maxTokens);
 
  let toolCallCount = 0;
 
  try {
    // Set a hard timeout
    const result = await Promise.race([
      agent.run(input, {
        onToolCall: (name, args) => {
          toolCallCount++;
          loopDetector.recordCall(name, args);
 
          const loopCheck = loopDetector.isLooping();
          if (loopCheck.looping) {
            throw new AgentLoopError(loopCheck.reason!);
          }
 
          if (toolCallCount > maxToolCalls) {
            throw new AgentLimitError(
              `Exceeded ${maxToolCalls} tool calls`
            );
          }
        },
        contextManager,
      }),
      timeoutPromise(maxDurationMs),
    ]);
 
    return {
      success: true,
      response: result,
      toolCalls: toolCallCount,
      tokensUsed: contextManager.getTotalTokens(),
      durationMs: Date.now() - startTime,
    };
  } catch (error) {
    const err = error as Error;
    options.onError?.(err);
 
    return {
      success: false,
      error: err.message,
      toolCalls: toolCallCount,
      tokensUsed: 0,
      durationMs: Date.now() - startTime,
    };
  }
}
 
class AgentLoopError extends Error {
  constructor(message: string) {
    super(message);
    this.name = 'AgentLoopError';
  }
}
 
class AgentLimitError extends Error {
  constructor(message: string) {
    super(message);
    this.name = 'AgentLimitError';
  }
}
 
function timeoutPromise(ms: number): Promise<never> {
  return new Promise((_, reject) =>
    setTimeout(
      () => reject(new Error(`Agent execution timed out after ${ms}ms`)),
      ms
    )
  );
}

Logging for AI Agents

Standard application logging is not enough. AI agent logs need additional context:

// agent-logger.ts
interface AgentLogEntry {
  agentId: string;
  conversationId: string;
  event:
    | 'start'
    | 'tool_call'
    | 'tool_result'
    | 'llm_request'
    | 'llm_response'
    | 'error'
    | 'complete';
  data: Record<string, unknown>;
  tokenCount?: number;
  costUsd?: number;
  durationMs?: number;
  timestamp: Date;
}
 
function logAgentEvent(entry: AgentLogEntry): void {
  // Structured log for observability
  console.log(
    JSON.stringify({
      level:
        entry.event === 'error' ? 'error' : 'info',
      service: 'ai-agent',
      agent_id: entry.agentId,
      conversation_id: entry.conversationId,
      event: entry.event,
      tokens: entry.tokenCount,
      cost_usd: entry.costUsd,
      duration_ms: entry.durationMs,
      ...entry.data,
      timestamp: entry.timestamp.toISOString(),
    })
  );
}

What to Alert On

Not every agent error needs a page. Set up tiered alerting:

const ALERT_RULES = {
  // Page on-call: agent is completely broken
  critical: [
    'Agent error rate > 50% in 5 minutes',
    'Agent loop detected 3+ times in 10 minutes',
    'Agent cost > $100 in 1 hour',
  ],
 
  // Slack notification: something is degraded
  warning: [
    'Agent error rate > 10% in 15 minutes',
    'Average agent latency > 30 seconds',
    'Tool failure rate > 20%',
  ],
 
  // Daily digest: review trends
  info: [
    'Daily agent cost summary',
    'Most common tool failures',
    'Context window prune frequency',
  ],
};

Graceful Degradation

When the agent fails, do not show users a blank error screen. Fall back to something useful:

async function handleUserRequest(
  userId: string,
  message: string
): Promise<string> {
  const result = await runAgentSafely(agent, message, {
    maxDurationMs: 30_000,
    onError: (err) => logAgentError(userId, err),
  });
 
  if (result.success) {
    return result.response!;
  }
 
  // Graceful degradation tiers
  if (result.error?.includes('timed out')) {
    return "I'm taking longer than expected to process this. I've saved your request and will follow up shortly.";
  }
 
  if (result.error?.includes('loop')) {
    return "I got stuck trying to solve this. Let me connect you with a human who can help.";
  }
 
  // Generic fallback
  return "I wasn't able to complete this request. Here are some things you can try, or I can connect you with support.";
}

The Takeaway

AI agents fail in fundamentally different ways than traditional software. Tool failures, context overflow, infinite loops, hallucinated calls, and cascading errors are not edge cases. They are guaranteed to happen in production.

Build the error boundaries from day one: limit tool calls, detect loops, manage context size, validate tool calls, set hard timeouts, and degrade gracefully. The code above handles the five most common failure modes.

If you want agent observability without building the infrastructure, Transactional's Error Tracking provides AI-specific error categorization, loop detection alerts, and cost anomaly monitoring. But the patterns above are the foundation regardless of your tooling.

Sources & References

[1]Anthropic Claude API Error Handling — Anthropic
[2]OpenAI API Error Codes — OpenAI
[3]Retry Pattern - Cloud Design Patterns — Microsoft
[4]Circuit Breaker Pattern — Microsoft

Written by

Transactional Team

Tags:

tutorial

error-tracking

Tutorials

Webhooks Will Fail. Here are the Retry and Idempotency Patterns That Save You.

Practical patterns for reliable webhook delivery: exponential backoff with jitter, idempotency keys, dead letter queues, and signature verification. TypeScript code included.

Transactional TeamMar 7, 2026

Industry Insights

We Evaluated 12 LLM Observability Tools. Most of Them Do Not Matter.

A practical evaluation of LLM observability tools across tracing, cost tracking, quality monitoring, and prompt management. What matters, what is marketing, and what to actually look for.

Transactional TeamMar 5, 2026

Case Studies

An Enterprise Team Was Shipping Hallucinations to Users. Traces Showed Them Where.

How an enterprise company with AI-powered customer support reduced hallucination rates from 8% to 0.3% and cut AI issue MTTR from days to minutes using LLM observability and trace-level analysis.

Transactional TeamMar 4, 2026

YOUR AGENTS DESERVE
REAL INFRASTRUCTURE.

START BUILDING AGENTS THAT DO REAL WORK.

Deploy Your First Agent