Engineering

10 min read

We Were Flying Blind on LLM Costs Until We Started Tracing Every Token

How we built token-level tracing to gain visibility into LLM costs, latency, and performance across providers. Architecture of the observability pipeline and the cost surprises we caught.

Transactional Team

Jan 30, 2026

{ }

10 min read

We Were Flying Blind on LLM Costs Until We Started Tracing Every Token

$47,000 in One Month. A Common Wake-Up Call.

Consider a scenario that plays out at many companies: a team expects roughly $8,000 per month in LLM costs based on request volume. Then the invoice arrives: $47,000.

The culprit is often a single feature. A content summarization pipeline shipped weeks prior, using GPT-4 where GPT-4o-mini would have been sufficient, sending full documents instead of truncated inputs, and retrying failed requests without exponential backoff -- meaning rate-limited requests compound into dozens of redundant calls.

None of this is visible in traditional monitoring. Standard APM tracks HTTP status codes and response times. It has no concept of tokens, model selection, or per-request cost. Running a production AI system without token-level tracing is the observability equivalent of checking the electricity bill once a month.

The ROI of Token-Level Tracing

$47KMonthly Spend Before Tracing

$9.2KMonthly Spend After Tracing

190xReturn on Tracing Investment

$200/moTracing Pipeline Cost

What Traditional APM Misses

Standard application monitoring tools were built for request-response architectures. They track latency, error rates, throughput. For LLM applications, these metrics are necessary but nowhere near sufficient.

Here is what you actually need to know:

Metric	Why It Matters
Tokens per request (prompt + completion)	Directly determines cost
Cost per request in USD	Aggregates to actual spend
Latency per provider per model	Performance comparison
Time to first token (TTFT)	User-perceived responsiveness
Cache hit rate	Cost savings validation
Failover frequency	Provider reliability
Error rate by type	Rate limits vs. content filters vs. model errors
Token efficiency	Are you sending too many tokens for the output you need?

Architecture of the Tracing Pipeline

Our observability system captures every LLM request at the gateway level. Here is the high-level flow:

Application Request
       |
       v
   AI Gateway
       |
  ┌────┴────┐
  │ Request │──── Capture: model, provider, prompt tokens,
  │  Phase  │     timestamp, request hash, user context
  └────┬────┘
       |
       v
   LLM Provider
       |
  ┌────┴─────┐
  │ Response │──── Capture: completion tokens, latency,
  │  Phase   │     TTFT, finish reason, status
  └────┬─────┘
       |
       v
  Cost Calculator ──── Model-specific pricing lookup
       |
       v
  Structured Log ──── Written to analytics pipeline
       |
       v
  ┌──────────┬──────────┐
  │ Real-time│  Batch   │
  │ Dashboard│ Analytics│
  └──────────┴──────────┘

The Trace Record

Every LLM interaction produces a trace record with this structure:

interface LLMTrace {
  // Identity
  traceId: string;
  spanId: string;
  parentSpanId?: string;     // For chained LLM calls
 
  // Request context
  organizationId: string;
  projectId: string;
  apiKeyId: string;
  userId?: string;
 
  // Model details
  provider: string;
  model: string;
  requestedModel: string;    // What the app asked for
  resolvedModel: string;     // What actually served it (after failover)
 
  // Token accounting
  promptTokens: number;
  completionTokens: number;
  totalTokens: number;
  cachedTokens: number;      // Tokens served from cache
 
  // Cost
  promptCostUsd: number;
  completionCostUsd: number;
  totalCostUsd: number;
 
  // Performance
  latencyMs: number;
  timeToFirstTokenMs: number;
  tokensPerSecond: number;
 
  // Outcome
  statusCode: number;
  finishReason: string;
  cacheHit: boolean;
  failoverAttempts: number;
  error?: {
    code: string;
    message: string;
    retryable: boolean;
  };
 
  // Timestamps
  requestedAt: Date;
  respondedAt: Date;
}

Cost Calculation

We maintain a pricing table for every model we support. Prices are stored as cost-per-million-tokens and updated weekly:

const modelPricing: Record<string, { promptPerMillion: number; completionPerMillion: number }> = {
  "gpt-4o":           { promptPerMillion: 2.50,  completionPerMillion: 10.00 },
  "gpt-4o-mini":      { promptPerMillion: 0.15,  completionPerMillion: 0.60  },
  "claude-sonnet-4-20250514": { promptPerMillion: 3.00,  completionPerMillion: 15.00 },
  "claude-haiku-4-5-20251001":  { promptPerMillion: 0.80,  completionPerMillion: 4.00  },
  "gemini-2.0-flash": { promptPerMillion: 0.10,  completionPerMillion: 0.40  },
  // ...
};
 
function calculateCost(model: string, promptTokens: number, completionTokens: number): number {
  const pricing = modelPricing[model];
  if (!pricing) return 0;
  return (promptTokens / 1_000_000) * pricing.promptPerMillion
       + (completionTokens / 1_000_000) * pricing.completionPerMillion;
}

The cost is calculated at request time and attached to the trace. This gives us real-time cost visibility rather than waiting for the provider's billing dashboard to update.

Common Cost Surprises Token Tracing Reveals

With full tracing in production, patterns that were previously invisible become obvious.

The Retry Amplifier

A common pattern: retry logic that resends the full request on any 429 (rate limit) response. Under load, a single user request can generate 5-8 LLM calls. The retries often fail too -- hitting the same rate limit. This single pattern can account for 30% or more of monthly spend.

The System Prompt Tax

Many services send 2,000+ token system prompts with every request. These prompts are often static -- the same for every call. Extracting common system prompts and using prompt caching (supported by Anthropic and OpenAI) can cut prompt token costs by 40% or more.

The Model Mismatch

A classification pipeline using GPT-4o to categorize support tickets into 8 categories is a common example. The task is simple pattern matching. Switching to GPT-4o-mini typically results in less than 1% accuracy loss while reducing cost by 94%.

The Embedding Leak

Embedding pipelines that re-embed documents that have not changed are a frequent source of waste. Without request-level tracing, this is invisible -- the embeddings endpoint does not return tokens in the same way as chat completions. Adding content hashing can cut embedding costs by 60% or more.

Real-Time Dashboards

The tracing data feeds into dashboards that show:

Cost Dashboard: Spend per hour, per model, per API key, per project. We set budget alerts at 80% of expected daily spend.

Latency Dashboard: P50/P95/P99 latency per provider and model. Time to first token distribution. We use this to make failover threshold decisions.

Cache Dashboard: Hit rate over time, estimated savings, cache size. This validates that our caching investment is paying off.

Error Dashboard: Error rate by type (rate limit, context length, content filter, timeout). Provider health at a glance.

Alerts That Matter

We run alerting on computed metrics, not raw values:

- Cost per hour exceeds 2x the trailing 7-day average → alert
- Cache hit rate drops below 50% of trailing average → alert
- P95 latency for any model exceeds 30 seconds → alert
- Error rate for any provider exceeds 5% over 15 minutes → alert
- Any single API key exceeds daily budget → alert + auto-throttle

The auto-throttle on budget exceeded is critical. Without it, a runaway process can burn through an entire month's budget in hours.

Implementation Lessons

Trace at the gateway, not the application. If you instrument individual services, you will miss calls, get inconsistent schemas, and have blind spots. The gateway sees everything.

Store raw token counts, calculate costs on read. Pricing changes. If you store pre-calculated costs and a provider retroactively adjusts pricing, your historical data is wrong. Store tokens and model identifiers; calculate costs at query time using a versioned pricing table.

Separate streaming latency from total latency. A streaming request that sends the first token in 200ms but takes 30 seconds to complete is a great user experience. Total latency alone would flag it as slow.

Budget alerts need hysteresis. A spike that triggers an alert, recovers, and re-triggers creates alert fatigue. We use a 15-minute cooldown window.

The ROI of Tracing

In the scenario above, deploying tracing and acting on the findings can bring monthly LLM spend from $47,000 back to around $9,200. A typical breakdown:

Model right-sizing: ~$18,000 saved
Fixing retry amplification: ~$12,000 saved
Prompt caching: ~$5,000 saved
Embedding deduplication: ~$2,800 saved

A tracing pipeline typically costs about $200/month in compute and storage. That represents a 190x return.

Key Takeaway

If you are running LLM workloads in production without token-level tracing, you are guaranteed to be overspending. The question is not whether you have waste -- it is how much.

The fix is not complicated. Capture every request, count every token, calculate every cost, and surface it in real time. The patterns will reveal themselves.

See how we built this into LLM Observability.

Sources & References

[1]OpenTelemetry Semantic Conventions for GenAI — OpenTelemetry
[2]OpenTelemetry Tracing Specification — OpenTelemetry
[3]OpenAI API Usage and Billing — OpenAI
[4]Anthropic Token Counting — Anthropic

Written by

Transactional Team

Tags:

observability

llm

cost

Industry Insights

We Evaluated 12 LLM Observability Tools. Most of Them Do Not Matter.

A practical evaluation of LLM observability tools across tracing, cost tracking, quality monitoring, and prompt management. What matters, what is marketing, and what to actually look for.

Transactional TeamMar 5, 2026

Case Studies

An Enterprise Team Was Shipping Hallucinations to Users. Traces Showed Them Where.

How an enterprise company with AI-powered customer support reduced hallucination rates from 8% to 0.3% and cut AI issue MTTR from days to minutes using LLM observability and trace-level analysis.

Transactional TeamMar 4, 2026

Tutorials

Track Your LLM Costs in Real-Time Before They Surprise You

Set up real-time cost tracking for LLM API calls with token counting, dashboards, alert thresholds, and budget controls. Practical TypeScript examples included.

Transactional TeamJan 24, 2026

YOUR AGENTS DESERVE
REAL INFRASTRUCTURE.

START BUILDING AGENTS THAT DO REAL WORK.

Deploy Your First Agent