Production Best Practices
Best practices for running Observability in production environments.
Overview
Running Observability in production requires careful attention to performance, sampling, privacy, and reliability. This guide covers best practices for production deployments.
Initialization
Proper Setup
Initialize once at application startup:
// lib/observability.ts
import { initObservability, getObservability } from '@transactional/observability';
if (!process.env.TRANSACTIONAL_OBSERVABILITY_DSN) {
console.warn('Observability DSN not configured');
}
initObservability({
dsn: process.env.TRANSACTIONAL_OBSERVABILITY_DSN,
enabled: process.env.NODE_ENV === 'production',
debug: process.env.NODE_ENV !== 'production',
});
export const obs = getObservability();Environment-Specific Config
initObservability({
dsn: process.env.TRANSACTIONAL_OBSERVABILITY_DSN,
// Production settings
enabled: process.env.NODE_ENV === 'production',
batchSize: 100,
flushInterval: 5000,
// Development: immediate flush, debug logs
...(process.env.NODE_ENV !== 'production' && {
batchSize: 1,
flushInterval: 0,
debug: true,
}),
});Sampling Strategies
Why Sample?
In high-volume production:
- Reduce costs
- Lower overhead
- Focus on important data
Head-Based Sampling
Decide at trace start:
function shouldTrace(): boolean {
// Sample 10% of requests
return Math.random() < 0.1;
}
async function handleRequest(req: Request) {
if (!shouldTrace()) {
return processWithoutTracing(req);
}
const trace = obs.trace({
name: 'api-request',
metadata: { sampled: true },
});
// ...
}Smart Sampling
Sample more for important scenarios:
function getSampleRate(request: Request): number {
// Always trace errors
if (request.headers.get('x-debug')) return 1.0;
// Sample more for premium users
if (request.user?.tier === 'enterprise') return 0.5;
// Sample more for new features
if (request.path.startsWith('/v2/')) return 0.3;
// Default rate
return 0.1;
}Tail-Based Sampling
Decide after trace completes:
// Keep traces that:
// - Have errors
// - Exceed latency threshold
// - Use significant tokens
const trace = obs.trace({ name: 'request' });
// ... do work ...
const shouldKeep =
hasError ||
duration > 5000 ||
totalTokens > 10000 ||
Math.random() < 0.1;
if (shouldKeep) {
await trace.end({ output });
} else {
trace.discard(); // Don't send to server
}Performance Optimization
Async Flushing
Don't block on observability:
async function handleRequest(req: Request): Promise<Response> {
const trace = obs.trace({ name: 'request' });
try {
const result = await process(req);
// Fire and forget - don't await
trace.end({ output: result }).catch(console.error);
return result;
} catch (error) {
trace.error(error).catch(console.error);
throw error;
}
}Batch Size Tuning
Optimize batch settings:
// High volume: larger batches, longer intervals
initObservability({
batchSize: 500,
flushInterval: 10000,
});
// Low latency: smaller batches, shorter intervals
initObservability({
batchSize: 50,
flushInterval: 1000,
});Graceful Shutdown
Flush before shutdown:
process.on('SIGTERM', async () => {
console.log('Shutting down...');
// Flush all pending traces
await obs.shutdown();
process.exit(0);
});Data Privacy
PII Handling
Don't log sensitive data:
// Bad - logs PII
trace({
name: 'login',
input: {
email: user.email,
password: user.password, // Never log!
},
});
// Good - sanitized
trace({
name: 'login',
input: {
hasEmail: true,
hasPassword: true,
},
});Input/Output Sanitization
function sanitize(data: any): any {
if (typeof data !== 'object') return data;
const sanitized = { ...data };
// Remove sensitive fields
const sensitiveFields = ['password', 'token', 'apiKey', 'ssn', 'creditCard'];
for (const field of sensitiveFields) {
if (field in sanitized) {
sanitized[field] = '[REDACTED]';
}
}
// Redact email addresses
if (sanitized.email) {
sanitized.email = redactEmail(sanitized.email);
}
return sanitized;
}
const trace = obs.trace({
name: 'request',
input: sanitize(request.body),
});LLM Content Filtering
Filter sensitive content from LLM inputs/outputs:
const generation = obs.generation({
name: 'completion',
input: {
messages: messages.map(m => ({
role: m.role,
content: m.content.length > 1000
? m.content.substring(0, 1000) + '...[truncated]'
: m.content,
})),
},
});Error Handling
Graceful Degradation
Observability failures shouldn't break your app:
async function tracedOperation() {
let trace;
try {
trace = obs.trace({ name: 'operation' });
} catch (e) {
// Observability failed, continue without tracing
console.error('Failed to create trace:', e);
return doOperation();
}
try {
const result = await doOperation();
await trace.end({ output: result }).catch(console.error);
return result;
} catch (error) {
await trace.error(error).catch(console.error);
throw error;
}
}Timeout Handling
Don't let observability hang:
async function safeFlush(timeout = 5000): Promise<void> {
return Promise.race([
obs.shutdown(),
new Promise((_, reject) =>
setTimeout(() => reject(new Error('Flush timeout')), timeout)
),
]).catch(console.error);
}Monitoring Observability
Health Checks
Monitor your monitoring:
// Metrics to track
const metrics = {
tracesCreated: 0,
tracesSent: 0,
tracesFailed: 0,
flushErrors: 0,
};
// Export for monitoring
app.get('/metrics', (req, res) => {
res.json(metrics);
});Alerts
Set up alerts for:
- High error rate in observability
- Flush failures
- Unusual trace volume
- Missing traces from critical paths
Cost Management
Estimate Costs
// Estimate traces per day
const tracesPerDay = requestsPerDay * sampleRate;
// Estimate tokens logged
const avgTokensPerTrace = 1000;
const tokensPerDay = tracesPerDay * avgTokensPerTrace;Cost Optimization
- Sample wisely: Don't trace everything
- Truncate large inputs: Limit logged content size
- Aggregate similar traces: Reduce unique traces
- Set retention policies: Delete old data
Checklist
Pre-Production
- DSN configured via environment variable
- Sampling strategy implemented
- PII filtering in place
- Graceful shutdown handling
- Error handling for observability failures
- Cost estimates reviewed
Post-Launch
- Monitor trace volume
- Check error rates
- Review sample of traces
- Verify sensitive data not logged
- Set up alerts
Next Steps
- Dashboard - Monitor your production data
- Metrics - Key production metrics
- Evaluation - Evaluate production quality
On This Page
- Overview
- Initialization
- Proper Setup
- Environment-Specific Config
- Sampling Strategies
- Why Sample?
- Head-Based Sampling
- Smart Sampling
- Tail-Based Sampling
- Performance Optimization
- Async Flushing
- Batch Size Tuning
- Graceful Shutdown
- Data Privacy
- PII Handling
- Input/Output Sanitization
- LLM Content Filtering
- Error Handling
- Graceful Degradation
- Timeout Handling
- Monitoring Observability
- Health Checks
- Alerts
- Cost Management
- Estimate Costs
- Cost Optimization
- Checklist
- Pre-Production
- Post-Launch
- Next Steps