Overview

Performance analysis helps you understand and optimize the latency of your AI applications. Track response times, identify bottlenecks, and improve user experience.

Key Performance Metrics

Duration

Total time from trace start to end.

Includes:

LLM generation time
Processing time
Network latency
Queue time

Time to First Token (TTFT)

For streaming: time until first token arrives.

Why it matters:

Perceived responsiveness
User experience
Streaming optimization

Generation Latency

Time for individual LLM calls.

Breakdown:

Request preparation
Network round-trip
Provider processing
Response handling

Performance Dashboard

Latency Overview

View latency distribution:

p50 (Median): 50% of requests faster
p95: 95% of requests faster
p99: 99% of requests faster
Max: Slowest request

Latency by Model

Compare model performance:

Model	p50	p95	p99
gpt-4o	1.5s	3.5s	8s
gpt-4o-mini	0.8s	1.8s	4s
claude-3-5-sonnet	1.2s	2.8s	6s
claude-3-haiku	0.5s	1.2s	3s

Latency Trend

Track performance over time:

Hourly/daily trends
Week-over-week comparison
Anomaly detection

Identifying Bottlenecks

Trace Timeline

View time breakdown:

Trace: rag-query (3.2s total)
├── Span: embed-query (200ms) ████
├── Span: vector-search (400ms) ████████
├── Span: format-context (50ms) █
└── Generation: generate (2.5s) ██████████████████████████████████████████

Slow Traces

Find slowest traces:

Go to Traces
Sort by Duration (descending)
Review top traces
Analyze timeline for bottlenecks

P95 Analysis

Focus on the worst experiences:

Filter traces with duration > p95
Look for common patterns
Identify root causes

Common Performance Issues

1. Long Context = Slow Response

Problem: Large prompts increase latency.

Solution:

// Trim context to essential information
const context = documents
  .slice(0, 5)  // Limit document count
  .map(d => d.summary)  // Use summaries not full content
  .join('\n');

2. Cold Starts

Problem: First request is slow.

Solution:

Keep connections warm
Use connection pooling
Implement health check pings

3. Rate Limiting Delays

Problem: Rate limits cause queuing.

Solution:

Implement request queuing
Use multiple API keys
Configure fallback providers

4. Network Latency

Problem: High network round-trip time.

Solution:

Deploy closer to provider regions
Use regional endpoints
Enable connection reuse

5. Sequential Operations

Problem: Operations run sequentially when they could parallelize.

Solution:

// Bad: Sequential
const embedding = await getEmbedding(query);
const userProfile = await getUserProfile(userId);
 
// Good: Parallel
const [embedding, userProfile] = await Promise.all([
  getEmbedding(query),
  getUserProfile(userId),
]);

Optimization Strategies

1. Model Selection for Speed

Choose faster models when appropriate:

Use Case	Fast Model	When to Use
Quick responses	gpt-4o-mini	Simple questions
Streaming	Any	User-facing
Classification	claude-3-haiku	Binary decisions

2. Prompt Length Optimization

Reduce prompt size:

// Before: 2000 tokens, 2.5s
const prompt = fullContext + fullInstructions + examples;
 
// After: 500 tokens, 0.8s
const prompt = summarizedContext + conciseInstructions;

3. Caching for Speed

Enable caching for instant responses:

First request: 2.5s (cache miss)
Second request: 50ms (cache hit) - 50x faster!

4. Streaming for Perception

Use streaming to improve perceived speed:

// Without streaming: 2.5s wait, then full response
// With streaming: 200ms to first token, progressive display
 
const stream = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [...],
  stream: true,  // Enable streaming
});

5. Pre-computation

Pre-compute when possible:

// Instead of computing on request:
// Pre-compute embeddings for common queries
// Cache document summaries
// Pre-generate frequent responses

6. Request Batching

Batch multiple operations:

// Instead of 10 separate embedding requests:
const embeddings = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: queries,  // Batch all queries
});

Setting Performance Alerts

Latency Alert

Alert when latency exceeds threshold:

Go to Settings > Alerts
Create alert:
- Metric: p95 Latency
- Threshold: > 5 seconds
- Window: 5 minutes

TTFT Alert

Alert on slow first token:

metric: ttft_p95
threshold: "> 1 second"
window: 5 minutes

Error Rate Alert

Alert on errors (often related to timeouts):

metric: error_rate
threshold: "> 5%"
window: 5 minutes

Performance Testing

Load Testing

Test performance under load:

Gradually increase request rate
Monitor latency percentiles
Find breaking point
Plan capacity

Baseline Establishment

Establish performance baselines:

Metric	Baseline	Alert Threshold
p50	1.5s	3s
p95	3.5s	7s
p99	8s	15s
Error rate	0.5%	2%

Performance Reports

Weekly Performance Report

Performance Report - Week 3, 2024

Latency Summary:
- p50: 1.4s (↓ 0.1s from last week)
- p95: 3.2s (↓ 0.3s from last week)
- p99: 7.5s (same as last week)

Improvements:
- Caching increased hit rate to 45%
- Prompt optimization reduced avg tokens by 20%

Issues:
- Tuesday 2pm: Spike to 15s p99 (provider issue)
- 5% of gpt-4o requests timing out

Recommendations:
- Add fallback for gpt-4o timeouts
- Increase cache TTL to improve hit rate

Next Steps

Metrics - All metrics
Cost Analysis - Cost optimization
Dashboard - Visualization

Performance Analysis