Transactional

Performance Analysis

Analyzing and optimizing latency in your AI applications.

Overview

Performance analysis helps you understand and optimize the latency of your AI applications. Track response times, identify bottlenecks, and improve user experience.

Key Performance Metrics

Duration

Total time from trace start to end.

Includes:

  • LLM generation time
  • Processing time
  • Network latency
  • Queue time

Time to First Token (TTFT)

For streaming: time until first token arrives.

Why it matters:

  • Perceived responsiveness
  • User experience
  • Streaming optimization

Generation Latency

Time for individual LLM calls.

Breakdown:

  • Request preparation
  • Network round-trip
  • Provider processing
  • Response handling

Performance Dashboard

Latency Overview

View latency distribution:

  • p50 (Median): 50% of requests faster
  • p95: 95% of requests faster
  • p99: 99% of requests faster
  • Max: Slowest request

Latency by Model

Compare model performance:

Modelp50p95p99
gpt-4o1.5s3.5s8s
gpt-4o-mini0.8s1.8s4s
claude-3-5-sonnet1.2s2.8s6s
claude-3-haiku0.5s1.2s3s

Latency Trend

Track performance over time:

  • Hourly/daily trends
  • Week-over-week comparison
  • Anomaly detection

Identifying Bottlenecks

Trace Timeline

View time breakdown:

Trace: rag-query (3.2s total)
├── Span: embed-query (200ms) ████
├── Span: vector-search (400ms) ████████
├── Span: format-context (50ms) █
└── Generation: generate (2.5s) ██████████████████████████████████████████

Slow Traces

Find slowest traces:

  1. Go to Traces
  2. Sort by Duration (descending)
  3. Review top traces
  4. Analyze timeline for bottlenecks

P95 Analysis

Focus on the worst experiences:

  1. Filter traces with duration > p95
  2. Look for common patterns
  3. Identify root causes

Common Performance Issues

1. Long Context = Slow Response

Problem: Large prompts increase latency.

Solution:

// Trim context to essential information
const context = documents
  .slice(0, 5)  // Limit document count
  .map(d => d.summary)  // Use summaries not full content
  .join('\n');

2. Cold Starts

Problem: First request is slow.

Solution:

  • Keep connections warm
  • Use connection pooling
  • Implement health check pings

3. Rate Limiting Delays

Problem: Rate limits cause queuing.

Solution:

  • Implement request queuing
  • Use multiple API keys
  • Configure fallback providers

4. Network Latency

Problem: High network round-trip time.

Solution:

  • Deploy closer to provider regions
  • Use regional endpoints
  • Enable connection reuse

5. Sequential Operations

Problem: Operations run sequentially when they could parallelize.

Solution:

// Bad: Sequential
const embedding = await getEmbedding(query);
const userProfile = await getUserProfile(userId);
 
// Good: Parallel
const [embedding, userProfile] = await Promise.all([
  getEmbedding(query),
  getUserProfile(userId),
]);

Optimization Strategies

1. Model Selection for Speed

Choose faster models when appropriate:

Use CaseFast ModelWhen to Use
Quick responsesgpt-4o-miniSimple questions
StreamingAnyUser-facing
Classificationclaude-3-haikuBinary decisions

2. Prompt Length Optimization

Reduce prompt size:

// Before: 2000 tokens, 2.5s
const prompt = fullContext + fullInstructions + examples;
 
// After: 500 tokens, 0.8s
const prompt = summarizedContext + conciseInstructions;

3. Caching for Speed

Enable caching for instant responses:

First request: 2.5s (cache miss)
Second request: 50ms (cache hit) - 50x faster!

4. Streaming for Perception

Use streaming to improve perceived speed:

// Without streaming: 2.5s wait, then full response
// With streaming: 200ms to first token, progressive display
 
const stream = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [...],
  stream: true,  // Enable streaming
});

5. Pre-computation

Pre-compute when possible:

// Instead of computing on request:
// Pre-compute embeddings for common queries
// Cache document summaries
// Pre-generate frequent responses

6. Request Batching

Batch multiple operations:

// Instead of 10 separate embedding requests:
const embeddings = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: queries,  // Batch all queries
});

Setting Performance Alerts

Latency Alert

Alert when latency exceeds threshold:

  1. Go to Settings > Alerts
  2. Create alert:
    • Metric: p95 Latency
    • Threshold: > 5 seconds
    • Window: 5 minutes

TTFT Alert

Alert on slow first token:

metric: ttft_p95
threshold: "> 1 second"
window: 5 minutes

Error Rate Alert

Alert on errors (often related to timeouts):

metric: error_rate
threshold: "> 5%"
window: 5 minutes

Performance Testing

Load Testing

Test performance under load:

  1. Gradually increase request rate
  2. Monitor latency percentiles
  3. Find breaking point
  4. Plan capacity

Baseline Establishment

Establish performance baselines:

MetricBaselineAlert Threshold
p501.5s3s
p953.5s7s
p998s15s
Error rate0.5%2%

Performance Reports

Weekly Performance Report

Performance Report - Week 3, 2024

Latency Summary:
- p50: 1.4s (↓ 0.1s from last week)
- p95: 3.2s (↓ 0.3s from last week)
- p99: 7.5s (same as last week)

Improvements:
- Caching increased hit rate to 45%
- Prompt optimization reduced avg tokens by 20%

Issues:
- Tuesday 2pm: Spike to 15s p99 (provider issue)
- 5% of gpt-4o requests timing out

Recommendations:
- Add fallback for gpt-4o timeouts
- Increase cache TTL to improve hit rate

Next Steps