We Were Flying Blind on LLM Costs Until We Started Tracing Every Token
How we built token-level tracing to gain visibility into LLM costs, latency, and performance across providers. Architecture of the observability pipeline and the cost surprises we caught.
Transactional Team
Jan 30, 2026
{ }
10 min read
Share
$47,000 in One Month. A Common Wake-Up Call.
Consider a scenario that plays out at many companies: a team expects roughly $8,000 per month in LLM costs based on request volume. Then the invoice arrives: $47,000.
The culprit is often a single feature. A content summarization pipeline shipped weeks prior, using GPT-4 where GPT-4o-mini would have been sufficient, sending full documents instead of truncated inputs, and retrying failed requests without exponential backoff -- meaning rate-limited requests compound into dozens of redundant calls.
None of this is visible in traditional monitoring. Standard APM tracks HTTP status codes and response times. It has no concept of tokens, model selection, or per-request cost. Running a production AI system without token-level tracing is the observability equivalent of checking the electricity bill once a month.
The ROI of Token-Level Tracing
$47KMonthly Spend Before Tracing
$9.2KMonthly Spend After Tracing
190xReturn on Tracing Investment
$200/moTracing Pipeline Cost
What Traditional APM Misses
Standard application monitoring tools were built for request-response architectures. They track latency, error rates, throughput. For LLM applications, these metrics are necessary but nowhere near sufficient.
Here is what you actually need to know:
Metric
Why It Matters
Tokens per request (prompt + completion)
Directly determines cost
Cost per request in USD
Aggregates to actual spend
Latency per provider per model
Performance comparison
Time to first token (TTFT)
User-perceived responsiveness
Cache hit rate
Cost savings validation
Failover frequency
Provider reliability
Error rate by type
Rate limits vs. content filters vs. model errors
Token efficiency
Are you sending too many tokens for the output you need?
Architecture of the Tracing Pipeline
Our observability system captures every LLM request at the gateway level. Here is the high-level flow:
Application Request
|
v
AI Gateway
|
┌────┴────┐
│ Request │──── Capture: model, provider, prompt tokens,
│ Phase │ timestamp, request hash, user context
└────┬────┘
|
v
LLM Provider
|
┌────┴─────┐
│ Response │──── Capture: completion tokens, latency,
│ Phase │ TTFT, finish reason, status
└────┬─────┘
|
v
Cost Calculator ──── Model-specific pricing lookup
|
v
Structured Log ──── Written to analytics pipeline
|
v
┌──────────┬──────────┐
│ Real-time│ Batch │
│ Dashboard│ Analytics│
└──────────┴──────────┘
The Trace Record
Every LLM interaction produces a trace record with this structure:
The cost is calculated at request time and attached to the trace. This gives us real-time cost visibility rather than waiting for the provider's billing dashboard to update.
Common Cost Surprises Token Tracing Reveals
With full tracing in production, patterns that were previously invisible become obvious.
The Retry Amplifier
A common pattern: retry logic that resends the full request on any 429 (rate limit) response. Under load, a single user request can generate 5-8 LLM calls. The retries often fail too -- hitting the same rate limit. This single pattern can account for 30% or more of monthly spend.
The System Prompt Tax
Many services send 2,000+ token system prompts with every request. These prompts are often static -- the same for every call. Extracting common system prompts and using prompt caching (supported by Anthropic and OpenAI) can cut prompt token costs by 40% or more.
The Model Mismatch
A classification pipeline using GPT-4o to categorize support tickets into 8 categories is a common example. The task is simple pattern matching. Switching to GPT-4o-mini typically results in less than 1% accuracy loss while reducing cost by 94%.
The Embedding Leak
Embedding pipelines that re-embed documents that have not changed are a frequent source of waste. Without request-level tracing, this is invisible -- the embeddings endpoint does not return tokens in the same way as chat completions. Adding content hashing can cut embedding costs by 60% or more.
Real-Time Dashboards
The tracing data feeds into dashboards that show:
Cost Dashboard: Spend per hour, per model, per API key, per project. We set budget alerts at 80% of expected daily spend.
Latency Dashboard: P50/P95/P99 latency per provider and model. Time to first token distribution. We use this to make failover threshold decisions.
Cache Dashboard: Hit rate over time, estimated savings, cache size. This validates that our caching investment is paying off.
Error Dashboard: Error rate by type (rate limit, context length, content filter, timeout). Provider health at a glance.
Alerts That Matter
We run alerting on computed metrics, not raw values:
- Cost per hour exceeds 2x the trailing 7-day average → alert
- Cache hit rate drops below 50% of trailing average → alert
- P95 latency for any model exceeds 30 seconds → alert
- Error rate for any provider exceeds 5% over 15 minutes → alert
- Any single API key exceeds daily budget → alert + auto-throttle
The auto-throttle on budget exceeded is critical. Without it, a runaway process can burn through an entire month's budget in hours.
Implementation Lessons
Trace at the gateway, not the application. If you instrument individual services, you will miss calls, get inconsistent schemas, and have blind spots. The gateway sees everything.
Store raw token counts, calculate costs on read. Pricing changes. If you store pre-calculated costs and a provider retroactively adjusts pricing, your historical data is wrong. Store tokens and model identifiers; calculate costs at query time using a versioned pricing table.
Separate streaming latency from total latency. A streaming request that sends the first token in 200ms but takes 30 seconds to complete is a great user experience. Total latency alone would flag it as slow.
Budget alerts need hysteresis. A spike that triggers an alert, recovers, and re-triggers creates alert fatigue. We use a 15-minute cooldown window.
The ROI of Tracing
In the scenario above, deploying tracing and acting on the findings can bring monthly LLM spend from $47,000 back to around $9,200. A typical breakdown:
Model right-sizing: ~$18,000 saved
Fixing retry amplification: ~$12,000 saved
Prompt caching: ~$5,000 saved
Embedding deduplication: ~$2,800 saved
A tracing pipeline typically costs about $200/month in compute and storage. That represents a 190x return.
Key Takeaway
If you are running LLM workloads in production without token-level tracing, you are guaranteed to be overspending. The question is not whether you have waste -- it is how much.
The fix is not complicated. Capture every request, count every token, calculate every cost, and surface it in real time. The patterns will reveal themselves.