We Evaluated 12 LLM Observability Tools. Most of Them Do Not Matter.
A practical evaluation of LLM observability tools across tracing, cost tracking, quality monitoring, and prompt management. What matters, what is marketing, and what to actually look for.
Transactional Team
Mar 5, 2026
::
12 min read
Share
Why This Evaluation Matters
Any team running LLM workloads in production needs observability: latency, costs, errors, quality. The market has exploded with options, making it difficult to separate substance from marketing.
Most of the tools available are fine. Some are good. A few are exceptional. But the majority are solving problems that do not matter as much as their marketing suggests, while ignoring the problems that actually cause production incidents.
Here is what a thorough evaluation reveals.
What LLM Observability Actually Needs to Do
Before evaluating tools, we defined what observability means for LLM applications. It is not the same as traditional application observability.
The Four Pillars
1. Tracing: Following a request through your LLM pipeline. Input, retrieval, context assembly, model call, output processing. Each step needs timing, inputs, outputs, and metadata.
2. Cost Tracking: Knowing what you are spending, broken down by model, feature, user, and time period. Token counts are the primitive. Dollar amounts are the output.
3. Quality Monitoring: Measuring whether the LLM is producing good outputs. This is the hardest pillar because "good" is subjective and context-dependent.
4. Prompt Management: Versioning, testing, and deploying prompts. This is more important than most teams realize, and less important than prompt management vendors claim.
LLM Observability Evaluation Criteria (Weighted)
Integration Effort
20
Trace Completeness
20
Cost Accuracy
15
Latency Impact
15
Quality Metrics
15
Alerting
10
Pricing Model
5
The Three Architectural Approaches
Every tool we evaluated falls into one of three categories. The architecture determines what data they can capture, how much latency they add, and how deeply they integrate.
SDK-Based
You install a library, wrap your LLM calls, and the SDK captures telemetry and sends it to the tool's backend.
Pros: Deep integration, rich data capture, can instrument custom logic, works with any LLM provider.
Cons: Code changes required, potential vendor lock-in, adds dependency to your application, needs updates when you add new LLM integrations.
You route your LLM API calls through a proxy that captures telemetry in transit.
Pros: No code changes beyond changing the API endpoint, works with any SDK or direct API calls, can add caching and rate limiting, lower integration effort.
Cons: Adds a network hop (latency), single point of failure risk, limited visibility into application-side processing, cannot instrument non-LLM steps.
Examples: Helicone proxy, Portkey, our own AI Gateway
Log-Based
You send structured logs or events to an analytics platform that processes them.
Pros: Most flexible, works with existing logging infrastructure, no proxy latency, can capture anything you log.
Cons: Most implementation effort, data quality depends on your logging discipline, no automatic capture.
Examples: Datadog LLM Observability, New Relic AI Monitoring, custom solutions on top of OpenTelemetry
What We Evaluated
We tested 12 tools against production-like workloads. Here is our evaluation framework.
Evaluation Criteria
Criterion
Weight
Why It Matters
Integration effort
20%
Time to first useful data
Trace completeness
20%
Can we see the full request lifecycle?
Cost accuracy
15%
Are cost calculations correct and granular?
Latency impact
15%
How much overhead does it add?
Quality metrics
15%
Can we measure output quality?
Alerting
10%
Can it tell us when things break?
Pricing model
5%
Is it affordable at scale?
The Results, Summarized
We are not going to do a feature-by-feature comparison. Those comparisons are outdated within weeks. Instead, here are the categories of tools and what we learned from each.
Category 1: Full-Stack LLM Platforms
These tools try to do everything: tracing, evaluation, prompt management, dataset curation, fine-tuning support. They are the "all-in-one" play.
What works: If you are starting from zero and want a single vendor, these get you running fast. The integrated experience is genuinely useful. Seeing traces alongside evaluations alongside prompt versions in one UI saves context-switching.
What does not work: Jack of all trades, master of few. The tracing is good but not as deep as dedicated APM tools. The evaluation features are useful but less rigorous than dedicated eval frameworks. The prompt management is adequate but lacks the collaboration features teams need at scale.
My take: Good for teams under 10 engineers working on a single AI product. Once you scale, you will outgrow specific features and start wishing you could swap components.
Category 2: Proxy-First Observability
These tools sit between you and your LLM provider, capturing everything in transit. They often add caching, rate limiting, and fallback routing on top of observability.
What works: Zero-code integration is real and valuable. Point your API calls at the proxy and you immediately get cost tracking, latency monitoring, and request logging. The operational features (caching, fallbacks) provide immediate value beyond observability.
What does not work: You only see the LLM API call. Everything that happens before and after, retrieval, context assembly, output processing, is invisible. For simple applications (prompt in, text out), this is fine. For RAG pipelines, agent loops, or multi-step workflows, you are flying blind on the most complex parts.
My take: Excellent as a layer in your stack, not as your entire observability solution. We built our AI Gateway with this approach because the operational benefits are real. But we also added SDK-level instrumentation for the full picture.
Category 3: APM Extensions
Traditional APM vendors (Datadog, New Relic, Dynatrace) have added LLM-specific features to their existing platforms.
What works: If you already use these tools for application monitoring, the integration is natural. LLM traces appear alongside your HTTP traces, database queries, and infrastructure metrics. Correlation is powerful. When an LLM call is slow, you can see whether it is the model, the network, or your application logic.
What does not work: LLM-specific features feel bolted on. The UIs are designed for traditional observability and do not handle the nuances of LLM data well. Prompt content is treated as a log message, not a first-class artifact. Cost tracking is basic. Quality metrics are minimal.
My take: If you are already paying for Datadog or New Relic, use their LLM features as a starting point. But you will likely need a dedicated tool for prompt management and quality monitoring.
Category 4: Evaluation-First Platforms
These tools focus primarily on LLM output quality: benchmarking, regression testing, human evaluation, automated scoring.
What works: They solve the hardest problem in LLM observability: measuring quality. Good evaluation tools let you define quality criteria, run automated assessments, track quality over time, and catch regressions before they reach production.
What does not work: They are not observability tools in the traditional sense. They do not help you debug production incidents, track costs, or monitor latency. They are a testing and quality assurance layer, not a monitoring layer.
My take: Essential for any team that cares about output quality, which should be every team. But use them alongside observability tools, not instead of them.
What Actually Matters vs. What Is Marketing
After three months of evaluation, here is our honest assessment of what features matter and which are marketing fluff.
Features That Matter
Request-level cost tracking: Knowing your monthly LLM spend is table stakes. Knowing your per-request, per-feature, per-customer cost is what lets you make pricing decisions, identify abuse, and optimize. This needs to be accurate, which means handling streaming responses, cached tokens, and batch API pricing correctly.
Latency distribution by model and endpoint: Not average latency. P50, P95, P99 by model, by prompt template, by time of day. LLM latency is highly variable and averages hide important patterns. A prompt that is fast 95% of the time but takes 30 seconds at P99 will cause timeout issues you will never find with averages.
Error categorization: Not just "the API returned 500." Was it a rate limit? A content filter? A timeout? A malformed response? Each error type has a different remediation. Tools that lump all errors together are not useful for debugging.
Trace correlation with application context: Seeing an LLM trace in isolation tells you what happened. Seeing it in the context of the user request, the retrieved documents, the application state, tells you why it happened. This is the difference between "the model was slow" and "the model was slow because the RAG pipeline retrieved 50 documents instead of 5 due to a low relevance threshold."
Features That Are Marketing
"AI-powered anomaly detection": Every observability tool now claims AI-powered anomaly detection. In practice, this means basic statistical thresholds with a GPT-generated summary. Set up proper alerts with explicit thresholds. You will catch more issues.
"Automatic prompt optimization": No tool can automatically optimize your prompts in a meaningful way. They can suggest changes based on A/B test results, but the actual prompt engineering requires domain knowledge and human judgment. Do not let a tool auto-modify your prompts in production.
"One-click compliance": Compliance requires documentation, process, and organizational commitment. No tool delivers compliance through a dashboard toggle. They can help with logging and audit trails, but the compliance work is on you.
"Unified LLM management": Managing multiple LLM providers through one dashboard sounds great. In practice, each provider has unique features, pricing models, and failure modes. Abstracting them into a unified interface loses the details that matter when you are debugging provider-specific issues.
What to Look for When Choosing
Questions to Ask
How does it handle streaming? Most LLM responses are streamed. If the tool only captures the final response, you lose timing granularity and cannot debug streaming-specific issues.
What is the data retention? LLM interactions contain potentially sensitive data. How long does the tool retain request and response content? Can you configure retention policies? Can you redact sensitive fields?
How does pricing scale? Some tools charge per trace, some per token, some per seat. Model your expected volume and calculate the actual cost at scale. We have seen tools that are cheap for evaluation become prohibitively expensive in production.
Can I export my data? If you cannot get your data out of the tool, you are locked in. Look for OpenTelemetry support, data export APIs, and standard formats.
Does it support my full pipeline? If you are building RAG, agents, or multi-step workflows, can the tool trace the entire pipeline or just the LLM call?
How Transactional Approaches This
Transactional's LLM Observability takes a hybrid approach. The AI Gateway captures request-level data as a proxy (cost, latency, errors, model routing). The SDK adds application-level context (RAG retrieval, agent tool calls, business logic). Both feed into a unified trace view.
The focus is on the features that actually matter: accurate cost tracking, latency distributions, error categorization, and trace correlation -- rather than checkbox features like AI-powered anomaly detection or automatic prompt optimization.
The Takeaway
LLM observability is a young market with a lot of noise. Most tools are adequate. Few are exceptional. The right choice depends on your architecture (SDK vs. proxy vs. log), your scale (hobby project vs. production workload), and what you already have in your stack.
Start with cost tracking and error monitoring. Those two capabilities will prevent the most production incidents. Add tracing when you need to debug specific requests. Add quality monitoring when you have enough traffic to make statistical assessments meaningful. Skip the features that sound impressive in demos but do not help you when your pager goes off at 3 AM.