Case Studies

8 min read

An Enterprise Team Was Shipping Hallucinations to Users. Traces Showed Them Where.

How an enterprise company with AI-powered customer support reduced hallucination rates from 8% to 0.3% and cut AI issue MTTR from days to minutes using LLM observability and trace-level analysis.

Transactional Team

Mar 4, 2026

8 min read

An Enterprise Team Was Shipping Hallucinations to Users. Traces Showed Them Where.

8% of the Time, the Bot Was Lying

An enterprise software company -- let us call them Nexera -- deployed AI-powered customer support agents in late 2025. The following is a composite scenario based on common patterns we have observed, not a specific customer engagement. The agents handled tier-1 support queries: billing questions, feature explanations, account management, troubleshooting guides. Standard stuff.

The agents were good. Customer satisfaction scores went up. Ticket resolution time dropped 60%. The support team loved it because they could focus on complex issues instead of answering the same questions repeatedly.

Then QA did a manual audit.

They sampled 500 agent responses and checked each one against the source documentation. 42 responses were wrong. Not vaguely wrong. Specifically, factually wrong. The agent told a customer their plan included a feature it did not. It quoted a pricing tier that did not exist. It described a configuration option that had been deprecated six months ago.

8.4% hallucination rate. And every one of those responses was delivered with complete confidence. No hedging. No disclaimers. Just incorrect information presented as fact.

The problem was not that the agent was hallucinating. All LLMs hallucinate. The problem was that Nexera had no way to detect which responses were wrong without manually reading every single one.

Nexera AI Observability: Before vs After

Before ObservabilityAfter Observability

Hallucination Rate8.4%0.3%

Issue MTTR3-5 days12 minutes

Quality DetectionManual auditAutomated

Customer Trust Score7291

The Invisible Failure Mode

Nexera's existing monitoring was comprehensive by traditional standards. They had:

Datadog for infrastructure and APM
PagerDuty for alerting
Custom dashboards for agent response times and resolution rates
CSAT surveys after every interaction

All of it showed green. Response times were fast. Resolution rates were high. CSAT scores were good. The 8% of customers who received wrong answers either did not notice, did not complain, or their low ratings got averaged out in the aggregate metrics.

The engineering team had no automated way to answer a basic question: "Is this response correct?"

They could tell you the p99 latency of the agent's API call. They could not tell you if the answer was right.

Setting Up LLM Observability

Nexera connected their support agents to our platform in phases.

Phase 1: Trace Everything (Week 1)

First, every LLM call in the support agent pipeline was instrumented. Nexera's agent architecture had three LLM steps per conversation:

Intent classification: Determine what the customer is asking about
Context retrieval + response generation: RAG query against their knowledge base, then generate a response
Response review: A second LLM call that checks the response for policy compliance

Each step became a traced span:

import { withObservability } from "@transactional/sdk/observability";
 
const tracedClient = withObservability(anthropicClient, {
  apiKey: "tx_live_...",
  feature: "support-agent",
  metadata: {
    agentVersion: "2.1",
    promptVersion: "support-v4"
  }
});
 
// Every call is now traced with full token counts, cost, and latency
const classification = await tracedClient.chat({
  model: "claude-haiku-4-5-20251001",
  messages: [{ role: "user", content: customerMessage }],
  metadata: { step: "intent-classification" }
});
 
const response = await tracedClient.chat({
  model: "claude-sonnet-4-20250514",
  messages: [...ragContext, { role: "user", content: customerMessage }],
  metadata: {
    step: "response-generation",
    retrievedDocs: docIds
  }
});

Within the first day, Nexera had complete visibility into every LLM call: what went in, what came out, how long it took, what it cost, and which knowledge base documents were used as context.

Phase 2: Quality Scoring (Week 2)

Raw traces are useful for debugging individual issues, but Nexera needed automated quality assessment across thousands of daily conversations.

Three quality scoring dimensions were configured:

Groundedness: Does the response only contain information present in the retrieved documents? This is the hallucination detector. The scorer compares claims in the response against the retrieved context and flags any claims that are not supported.

Relevance: Does the response actually address the customer's question? A grounded response is useless if it answers the wrong question.

Format compliance: Does the response follow Nexera's style guidelines? No jargon, no speculation, proper formatting, required disclaimers for certain topics.

// Quality scoring configuration
{
  qualityScoring: {
    enabled: true,
    dimensions: {
      groundedness: {
        enabled: true,
        threshold: 0.90,  // Flag responses below 90%
        context: "retrieved_docs"
      },
      relevance: {
        enabled: true,
        threshold: 0.85
      },
      formatCompliance: {
        enabled: true,
        rules: nexeraStyleGuide
      }
    },
    alertOnFailure: true,
    alertChannels: ["slack:#ai-quality"]
  }
}

Quality scores are computed asynchronously. They do not add latency to the customer conversation. Scores are attached to traces within seconds and available in the dashboard.

Phase 3: Automated Hallucination Detection (Week 3)

With groundedness scoring running on every response, patterns emerged immediately.

The dashboard showed that 7.9% of responses had a groundedness score below 0.85. This matched the 8.4% figure from the manual audit almost exactly, validating that the automated scoring was catching real hallucinations.

But the data revealed something the manual audit could not: hallucinations were not uniformly distributed.

By topic: Billing questions had a 2.1% hallucination rate. Feature questions had 5.3%. Troubleshooting questions had 18.7%. The troubleshooting category was dragging up the overall average.

By prompt section: Hallucinations spiked when the retrieved context contained multiple conflicting knowledge base articles. The agent would blend information from different articles, creating responses that were not fully supported by any single source.

By time: Hallucination rate was higher on Mondays. After investigation, this correlated with knowledge base updates that the docs team published on Friday afternoons. The embeddings for updated articles were regenerated over the weekend, but there was a 12-hour window where the retrieval system returned stale content.

None of these patterns would have been visible without trace-level observability.

The Fixes

Armed with specific data about what was going wrong and where, Nexera made targeted fixes.

Fix 1: Context Window Deduplication

When multiple knowledge base articles contained overlapping information, the RAG system retrieved all of them. The agent saw conflicting or redundant context and sometimes blended the information incorrectly.

Nexera added a deduplication step that checks retrieved documents for semantic overlap and keeps only the most relevant, non-contradictory set. This alone dropped the troubleshooting hallucination rate from 18.7% to 6.2%.

Fix 2: Source Attribution

The response generation prompt was updated to require explicit source attribution. Instead of generating a free-form answer, the agent now cites which knowledge base article supports each claim.

Before: "Your plan includes up to 50 users."
After: "Your plan includes up to 50 users (ref: pricing-enterprise-v3, section 2.1)."

This forced the model to ground every claim in a specific source. If it could not find a source, it was instructed to say "I don't have specific information about that" instead of guessing. Hallucination rate for feature questions dropped from 5.3% to 0.8%.

Fix 3: Stale Content Guard

They added a check that compares the timestamp of retrieved documents against the current embedding index version. If a document was updated after the embeddings were last generated, the agent flags the response as potentially using stale information and either defers to a human agent or adds a disclaimer.

This eliminated the Monday hallucination spike entirely.

Fix 4: Confidence Gating

For responses with a groundedness score below 0.90, instead of sending them directly to the customer, the system routes them to a human review queue. A support agent reviews the response, edits if needed, and sends it manually.

This acts as a safety net. The 0.3% of responses that still hallucinate never reach the customer unreviewed.

The Results

Measured over a 90-day period after all fixes were deployed:

Metric	Before	After
Overall hallucination rate	8.4%	0.3%
Billing hallucination rate	2.1%	0.1%
Feature hallucination rate	5.3%	0.2%
Troubleshooting hallucination rate	18.7%	0.8%
MTTR for AI quality issues	3-5 days	12 minutes
Responses routed to human review	0%	1.2%
Customer trust score (internal)	72	91

The MTTR improvement is worth highlighting. Before observability, the process for identifying an AI quality issue was:

Customer complains (hours to days after the interaction)
Support team escalates to engineering (hours)
Engineering searches logs for the conversation (hours)
Engineering manually reviews the response (minutes)
Engineering tries to reproduce the issue (hours to days)

After observability:

Quality alert fires in Slack with the trace link (seconds)
Engineer clicks through to the full trace (seconds)
Engineer sees the exact prompt, context, response, and quality scores (seconds)
Engineer identifies the root cause from the trace data (minutes)

From days to minutes. Not because the engineers work faster, but because they no longer spend 95% of their time finding and reproducing the problem.

What Nexera Learned

The engineering lead at Nexera shared three lessons that apply to anyone running AI agents in production:

Your aggregate metrics are hiding your worst failures. A 92% CSAT score looks good. It hides the fact that 8% of your responses are wrong. Traditional monitoring averages away the failures that matter most.

Hallucinations have patterns. They are not random. They cluster by topic, by context quality, by time of day. Once you can see the patterns, the fixes are often straightforward.

You need a safety net, not just prevention. No amount of prompt engineering will reduce hallucinations to zero. The question is whether wrong responses reach your customers or get caught first. Confidence gating with human review is that safety net.

The Takeaway

AI agents are good enough to deploy in production. They are not good enough to deploy without observability.

The difference between a successful AI deployment and a liability is not the model you use or the prompts you write. It is whether you can see what the model is actually saying to your users and catch the failures before they compound.

Nexera's agents are now handling 40% more conversations than before, with better accuracy and higher customer trust. Not because they switched to a better model. Because they can see what the model is doing.

Explore LLM Observability to add trace-level quality monitoring to your AI agents.

Sources & References

[1]OpenTelemetry Documentation — OpenTelemetry
[2]OpenTelemetry Semantic Conventions for GenAI — OpenTelemetry
[3]Anthropic Guide to Prompt Engineering — Anthropic
[4]Observability Engineering — O'Reilly Media

Written by

Transactional Team

Tags:

case-study

observability

Industry Insights

We Evaluated 12 LLM Observability Tools. Most of Them Do Not Matter.

A practical evaluation of LLM observability tools across tracing, cost tracking, quality monitoring, and prompt management. What matters, what is marketing, and what to actually look for.

Transactional TeamMar 5, 2026

Tutorials

Your AI Agent Will Crash in Production. Plan for It.

Common AI agent failure modes and how to handle them: tool execution failures, context window overflow, infinite loops, and hallucinated function calls. Production-ready error patterns with code.

Transactional TeamMar 3, 2026