Performance Analysis
Analyzing and optimizing latency in your AI applications.
Overview
Performance analysis helps you understand and optimize the latency of your AI applications. Track response times, identify bottlenecks, and improve user experience.
Key Performance Metrics
Duration
Total time from trace start to end.
Includes:
- LLM generation time
- Processing time
- Network latency
- Queue time
Time to First Token (TTFT)
For streaming: time until first token arrives.
Why it matters:
- Perceived responsiveness
- User experience
- Streaming optimization
Generation Latency
Time for individual LLM calls.
Breakdown:
- Request preparation
- Network round-trip
- Provider processing
- Response handling
Performance Dashboard
Latency Overview
View latency distribution:
- p50 (Median): 50% of requests faster
- p95: 95% of requests faster
- p99: 99% of requests faster
- Max: Slowest request
Latency by Model
Compare model performance:
| Model | p50 | p95 | p99 |
|---|---|---|---|
| gpt-4o | 1.5s | 3.5s | 8s |
| gpt-4o-mini | 0.8s | 1.8s | 4s |
| claude-3-5-sonnet | 1.2s | 2.8s | 6s |
| claude-3-haiku | 0.5s | 1.2s | 3s |
Latency Trend
Track performance over time:
- Hourly/daily trends
- Week-over-week comparison
- Anomaly detection
Identifying Bottlenecks
Trace Timeline
View time breakdown:
Trace: rag-query (3.2s total)
├── Span: embed-query (200ms) ████
├── Span: vector-search (400ms) ████████
├── Span: format-context (50ms) █
└── Generation: generate (2.5s) ██████████████████████████████████████████
Slow Traces
Find slowest traces:
- Go to Traces
- Sort by Duration (descending)
- Review top traces
- Analyze timeline for bottlenecks
P95 Analysis
Focus on the worst experiences:
- Filter traces with duration > p95
- Look for common patterns
- Identify root causes
Common Performance Issues
1. Long Context = Slow Response
Problem: Large prompts increase latency.
Solution:
// Trim context to essential information
const context = documents
.slice(0, 5) // Limit document count
.map(d => d.summary) // Use summaries not full content
.join('\n');2. Cold Starts
Problem: First request is slow.
Solution:
- Keep connections warm
- Use connection pooling
- Implement health check pings
3. Rate Limiting Delays
Problem: Rate limits cause queuing.
Solution:
- Implement request queuing
- Use multiple API keys
- Configure fallback providers
4. Network Latency
Problem: High network round-trip time.
Solution:
- Deploy closer to provider regions
- Use regional endpoints
- Enable connection reuse
5. Sequential Operations
Problem: Operations run sequentially when they could parallelize.
Solution:
// Bad: Sequential
const embedding = await getEmbedding(query);
const userProfile = await getUserProfile(userId);
// Good: Parallel
const [embedding, userProfile] = await Promise.all([
getEmbedding(query),
getUserProfile(userId),
]);Optimization Strategies
1. Model Selection for Speed
Choose faster models when appropriate:
| Use Case | Fast Model | When to Use |
|---|---|---|
| Quick responses | gpt-4o-mini | Simple questions |
| Streaming | Any | User-facing |
| Classification | claude-3-haiku | Binary decisions |
2. Prompt Length Optimization
Reduce prompt size:
// Before: 2000 tokens, 2.5s
const prompt = fullContext + fullInstructions + examples;
// After: 500 tokens, 0.8s
const prompt = summarizedContext + conciseInstructions;3. Caching for Speed
Enable caching for instant responses:
First request: 2.5s (cache miss)
Second request: 50ms (cache hit) - 50x faster!
4. Streaming for Perception
Use streaming to improve perceived speed:
// Without streaming: 2.5s wait, then full response
// With streaming: 200ms to first token, progressive display
const stream = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [...],
stream: true, // Enable streaming
});5. Pre-computation
Pre-compute when possible:
// Instead of computing on request:
// Pre-compute embeddings for common queries
// Cache document summaries
// Pre-generate frequent responses6. Request Batching
Batch multiple operations:
// Instead of 10 separate embedding requests:
const embeddings = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: queries, // Batch all queries
});Setting Performance Alerts
Latency Alert
Alert when latency exceeds threshold:
- Go to Settings > Alerts
- Create alert:
- Metric: p95 Latency
- Threshold: > 5 seconds
- Window: 5 minutes
TTFT Alert
Alert on slow first token:
metric: ttft_p95
threshold: "> 1 second"
window: 5 minutesError Rate Alert
Alert on errors (often related to timeouts):
metric: error_rate
threshold: "> 5%"
window: 5 minutesPerformance Testing
Load Testing
Test performance under load:
- Gradually increase request rate
- Monitor latency percentiles
- Find breaking point
- Plan capacity
Baseline Establishment
Establish performance baselines:
| Metric | Baseline | Alert Threshold |
|---|---|---|
| p50 | 1.5s | 3s |
| p95 | 3.5s | 7s |
| p99 | 8s | 15s |
| Error rate | 0.5% | 2% |
Performance Reports
Weekly Performance Report
Performance Report - Week 3, 2024
Latency Summary:
- p50: 1.4s (↓ 0.1s from last week)
- p95: 3.2s (↓ 0.3s from last week)
- p99: 7.5s (same as last week)
Improvements:
- Caching increased hit rate to 45%
- Prompt optimization reduced avg tokens by 20%
Issues:
- Tuesday 2pm: Spike to 15s p99 (provider issue)
- 5% of gpt-4o requests timing out
Recommendations:
- Add fallback for gpt-4o timeouts
- Increase cache TTL to improve hit rate
Next Steps
- Metrics - All metrics
- Cost Analysis - Cost optimization
- Dashboard - Visualization
On This Page
- Overview
- Key Performance Metrics
- Duration
- Time to First Token (TTFT)
- Generation Latency
- Performance Dashboard
- Latency Overview
- Latency by Model
- Latency Trend
- Identifying Bottlenecks
- Trace Timeline
- Slow Traces
- P95 Analysis
- Common Performance Issues
- 1. Long Context = Slow Response
- 2. Cold Starts
- 3. Rate Limiting Delays
- 4. Network Latency
- 5. Sequential Operations
- Optimization Strategies
- 1. Model Selection for Speed
- 2. Prompt Length Optimization
- 3. Caching for Speed
- 4. Streaming for Perception
- 5. Pre-computation
- 6. Request Batching
- Setting Performance Alerts
- Latency Alert
- TTFT Alert
- Error Rate Alert
- Performance Testing
- Load Testing
- Baseline Establishment
- Performance Reports
- Weekly Performance Report
- Next Steps