Engineering
10 min read

We Were Flying Blind on LLM Costs Until We Started Tracing Every Token

How we built token-level tracing to gain visibility into LLM costs, latency, and performance across providers. Architecture of the observability pipeline and the cost surprises we caught.

Transactional Team
Jan 30, 2026
10 min read
Share
We Were Flying Blind on LLM Costs Until We Started Tracing Every Token

$47,000 in One Month. A Common Wake-Up Call.

Consider a scenario that plays out at many companies: a team expects roughly $8,000 per month in LLM costs based on request volume. Then the invoice arrives: $47,000.

The culprit is often a single feature. A content summarization pipeline shipped weeks prior, using GPT-4 where GPT-4o-mini would have been sufficient, sending full documents instead of truncated inputs, and retrying failed requests without exponential backoff -- meaning rate-limited requests compound into dozens of redundant calls.

None of this is visible in traditional monitoring. Standard APM tracks HTTP status codes and response times. It has no concept of tokens, model selection, or per-request cost. Running a production AI system without token-level tracing is the observability equivalent of checking the electricity bill once a month.

The ROI of Token-Level Tracing

$47KMonthly Spend Before Tracing
$9.2KMonthly Spend After Tracing
190xReturn on Tracing Investment
$200/moTracing Pipeline Cost

What Traditional APM Misses

Standard application monitoring tools were built for request-response architectures. They track latency, error rates, throughput. For LLM applications, these metrics are necessary but nowhere near sufficient.

Here is what you actually need to know:

MetricWhy It Matters
Tokens per request (prompt + completion)Directly determines cost
Cost per request in USDAggregates to actual spend
Latency per provider per modelPerformance comparison
Time to first token (TTFT)User-perceived responsiveness
Cache hit rateCost savings validation
Failover frequencyProvider reliability
Error rate by typeRate limits vs. content filters vs. model errors
Token efficiencyAre you sending too many tokens for the output you need?

Architecture of the Tracing Pipeline

Our observability system captures every LLM request at the gateway level. Here is the high-level flow:

Application Request
       |
       v
   AI Gateway
       |
  ┌────┴────┐
  │ Request │──── Capture: model, provider, prompt tokens,
  │  Phase  │     timestamp, request hash, user context
  └────┬────┘
       |
       v
   LLM Provider
       |
  ┌────┴─────┐
  │ Response │──── Capture: completion tokens, latency,
  │  Phase   │     TTFT, finish reason, status
  └────┬─────┘
       |
       v
  Cost Calculator ──── Model-specific pricing lookup
       |
       v
  Structured Log ──── Written to analytics pipeline
       |
       v
  ┌──────────┬──────────┐
  │ Real-time│  Batch   │
  │ Dashboard│ Analytics│
  └──────────┴──────────┘

The Trace Record

Every LLM interaction produces a trace record with this structure:

interface LLMTrace {
  // Identity
  traceId: string;
  spanId: string;
  parentSpanId?: string;     // For chained LLM calls
 
  // Request context
  organizationId: string;
  projectId: string;
  apiKeyId: string;
  userId?: string;
 
  // Model details
  provider: string;
  model: string;
  requestedModel: string;    // What the app asked for
  resolvedModel: string;     // What actually served it (after failover)
 
  // Token accounting
  promptTokens: number;
  completionTokens: number;
  totalTokens: number;
  cachedTokens: number;      // Tokens served from cache
 
  // Cost
  promptCostUsd: number;
  completionCostUsd: number;
  totalCostUsd: number;
 
  // Performance
  latencyMs: number;
  timeToFirstTokenMs: number;
  tokensPerSecond: number;
 
  // Outcome
  statusCode: number;
  finishReason: string;
  cacheHit: boolean;
  failoverAttempts: number;
  error?: {
    code: string;
    message: string;
    retryable: boolean;
  };
 
  // Timestamps
  requestedAt: Date;
  respondedAt: Date;
}

Cost Calculation

We maintain a pricing table for every model we support. Prices are stored as cost-per-million-tokens and updated weekly:

const modelPricing: Record<string, { promptPerMillion: number; completionPerMillion: number }> = {
  "gpt-4o":           { promptPerMillion: 2.50,  completionPerMillion: 10.00 },
  "gpt-4o-mini":      { promptPerMillion: 0.15,  completionPerMillion: 0.60  },
  "claude-sonnet-4-20250514": { promptPerMillion: 3.00,  completionPerMillion: 15.00 },
  "claude-haiku-4-5-20251001":  { promptPerMillion: 0.80,  completionPerMillion: 4.00  },
  "gemini-2.0-flash": { promptPerMillion: 0.10,  completionPerMillion: 0.40  },
  // ...
};
 
function calculateCost(model: string, promptTokens: number, completionTokens: number): number {
  const pricing = modelPricing[model];
  if (!pricing) return 0;
  return (promptTokens / 1_000_000) * pricing.promptPerMillion
       + (completionTokens / 1_000_000) * pricing.completionPerMillion;
}

The cost is calculated at request time and attached to the trace. This gives us real-time cost visibility rather than waiting for the provider's billing dashboard to update.

Common Cost Surprises Token Tracing Reveals

With full tracing in production, patterns that were previously invisible become obvious.

The Retry Amplifier

A common pattern: retry logic that resends the full request on any 429 (rate limit) response. Under load, a single user request can generate 5-8 LLM calls. The retries often fail too -- hitting the same rate limit. This single pattern can account for 30% or more of monthly spend.

The System Prompt Tax

Many services send 2,000+ token system prompts with every request. These prompts are often static -- the same for every call. Extracting common system prompts and using prompt caching (supported by Anthropic and OpenAI) can cut prompt token costs by 40% or more.

The Model Mismatch

A classification pipeline using GPT-4o to categorize support tickets into 8 categories is a common example. The task is simple pattern matching. Switching to GPT-4o-mini typically results in less than 1% accuracy loss while reducing cost by 94%.

The Embedding Leak

Embedding pipelines that re-embed documents that have not changed are a frequent source of waste. Without request-level tracing, this is invisible -- the embeddings endpoint does not return tokens in the same way as chat completions. Adding content hashing can cut embedding costs by 60% or more.

Real-Time Dashboards

The tracing data feeds into dashboards that show:

Cost Dashboard: Spend per hour, per model, per API key, per project. We set budget alerts at 80% of expected daily spend.

Latency Dashboard: P50/P95/P99 latency per provider and model. Time to first token distribution. We use this to make failover threshold decisions.

Cache Dashboard: Hit rate over time, estimated savings, cache size. This validates that our caching investment is paying off.

Error Dashboard: Error rate by type (rate limit, context length, content filter, timeout). Provider health at a glance.

Alerts That Matter

We run alerting on computed metrics, not raw values:

- Cost per hour exceeds 2x the trailing 7-day average → alert
- Cache hit rate drops below 50% of trailing average → alert
- P95 latency for any model exceeds 30 seconds → alert
- Error rate for any provider exceeds 5% over 15 minutes → alert
- Any single API key exceeds daily budget → alert + auto-throttle

The auto-throttle on budget exceeded is critical. Without it, a runaway process can burn through an entire month's budget in hours.

Implementation Lessons

Trace at the gateway, not the application. If you instrument individual services, you will miss calls, get inconsistent schemas, and have blind spots. The gateway sees everything.

Store raw token counts, calculate costs on read. Pricing changes. If you store pre-calculated costs and a provider retroactively adjusts pricing, your historical data is wrong. Store tokens and model identifiers; calculate costs at query time using a versioned pricing table.

Separate streaming latency from total latency. A streaming request that sends the first token in 200ms but takes 30 seconds to complete is a great user experience. Total latency alone would flag it as slow.

Budget alerts need hysteresis. A spike that triggers an alert, recovers, and re-triggers creates alert fatigue. We use a 15-minute cooldown window.

The ROI of Tracing

In the scenario above, deploying tracing and acting on the findings can bring monthly LLM spend from $47,000 back to around $9,200. A typical breakdown:

  • Model right-sizing: ~$18,000 saved
  • Fixing retry amplification: ~$12,000 saved
  • Prompt caching: ~$5,000 saved
  • Embedding deduplication: ~$2,800 saved

A tracing pipeline typically costs about $200/month in compute and storage. That represents a 190x return.

Key Takeaway

If you are running LLM workloads in production without token-level tracing, you are guaranteed to be overspending. The question is not whether you have waste -- it is how much.

The fix is not complicated. Capture every request, count every token, calculate every cost, and surface it in real time. The patterns will reveal themselves.

See how we built this into LLM Observability.

Sources & References

  1. [1]OpenTelemetry Semantic Conventions for GenAIOpenTelemetry
  2. [2]OpenTelemetry Tracing SpecificationOpenTelemetry
  3. [3]OpenAI API Usage and BillingOpenAI
  4. [4]Anthropic Token CountingAnthropic

Written by

Transactional Team

Share
Tags:
observability
llm
cost

YOUR AGENTS DESERVE
REAL INFRASTRUCTURE.

START BUILDING AGENTS THAT DO REAL WORK.

Deploy Your First Agent