Production Best Practices

Best practices for running Observability in production environments.

Overview

Running Observability in production requires careful attention to performance, sampling, privacy, and reliability. This guide covers best practices for production deployments.

Initialization

Proper Setup

Initialize once at application startup:

// lib/observability.ts
import { initObservability, getObservability } from '@transactional/observability';
 
if (!process.env.TRANSACTIONAL_OBSERVABILITY_DSN) {
  console.warn('Observability DSN not configured');
}
 
initObservability({
  dsn: process.env.TRANSACTIONAL_OBSERVABILITY_DSN,
  enabled: process.env.NODE_ENV === 'production',
  debug: process.env.NODE_ENV !== 'production',
});
 
export const obs = getObservability();

Environment-Specific Config

initObservability({
  dsn: process.env.TRANSACTIONAL_OBSERVABILITY_DSN,
 
  // Production settings
  enabled: process.env.NODE_ENV === 'production',
  batchSize: 100,
  flushInterval: 5000,
 
  // Development: immediate flush, debug logs
  ...(process.env.NODE_ENV !== 'production' && {
    batchSize: 1,
    flushInterval: 0,
    debug: true,
  }),
});

Sampling Strategies

Why Sample?

In high-volume production:

  • Reduce costs
  • Lower overhead
  • Focus on important data

Head-Based Sampling

Decide at trace start:

function shouldTrace(): boolean {
  // Sample 10% of requests
  return Math.random() < 0.1;
}
 
async function handleRequest(req: Request) {
  if (!shouldTrace()) {
    return processWithoutTracing(req);
  }
 
  const trace = obs.trace({
    name: 'api-request',
    metadata: { sampled: true },
  });
  // ...
}

Smart Sampling

Sample more for important scenarios:

function getSampleRate(request: Request): number {
  // Always trace errors
  if (request.headers.get('x-debug')) return 1.0;
 
  // Sample more for premium users
  if (request.user?.tier === 'enterprise') return 0.5;
 
  // Sample more for new features
  if (request.path.startsWith('/v2/')) return 0.3;
 
  // Default rate
  return 0.1;
}

Tail-Based Sampling

Decide after trace completes:

// Keep traces that:
// - Have errors
// - Exceed latency threshold
// - Use significant tokens
 
const trace = obs.trace({ name: 'request' });
 
// ... do work ...
 
const shouldKeep =
  hasError ||
  duration > 5000 ||
  totalTokens > 10000 ||
  Math.random() < 0.1;
 
if (shouldKeep) {
  await trace.end({ output });
} else {
  trace.discard();  // Don't send to server
}

Performance Optimization

Async Flushing

Don't block on observability:

async function handleRequest(req: Request): Promise<Response> {
  const trace = obs.trace({ name: 'request' });
 
  try {
    const result = await process(req);
 
    // Fire and forget - don't await
    trace.end({ output: result }).catch(console.error);
 
    return result;
  } catch (error) {
    trace.error(error).catch(console.error);
    throw error;
  }
}

Batch Size Tuning

Optimize batch settings:

// High volume: larger batches, longer intervals
initObservability({
  batchSize: 500,
  flushInterval: 10000,
});
 
// Low latency: smaller batches, shorter intervals
initObservability({
  batchSize: 50,
  flushInterval: 1000,
});

Graceful Shutdown

Flush before shutdown:

process.on('SIGTERM', async () => {
  console.log('Shutting down...');
 
  // Flush all pending traces
  await obs.shutdown();
 
  process.exit(0);
});

Data Privacy

PII Handling

Don't log sensitive data:

// Bad - logs PII
trace({
  name: 'login',
  input: {
    email: user.email,
    password: user.password,  // Never log!
  },
});
 
// Good - sanitized
trace({
  name: 'login',
  input: {
    hasEmail: true,
    hasPassword: true,
  },
});

Input/Output Sanitization

function sanitize(data: any): any {
  if (typeof data !== 'object') return data;
 
  const sanitized = { ...data };
 
  // Remove sensitive fields
  const sensitiveFields = ['password', 'token', 'apiKey', 'ssn', 'creditCard'];
  for (const field of sensitiveFields) {
    if (field in sanitized) {
      sanitized[field] = '[REDACTED]';
    }
  }
 
  // Redact email addresses
  if (sanitized.email) {
    sanitized.email = redactEmail(sanitized.email);
  }
 
  return sanitized;
}
 
const trace = obs.trace({
  name: 'request',
  input: sanitize(request.body),
});

LLM Content Filtering

Filter sensitive content from LLM inputs/outputs:

const generation = obs.generation({
  name: 'completion',
  input: {
    messages: messages.map(m => ({
      role: m.role,
      content: m.content.length > 1000
        ? m.content.substring(0, 1000) + '...[truncated]'
        : m.content,
    })),
  },
});

Error Handling

Graceful Degradation

Observability failures shouldn't break your app:

async function tracedOperation() {
  let trace;
 
  try {
    trace = obs.trace({ name: 'operation' });
  } catch (e) {
    // Observability failed, continue without tracing
    console.error('Failed to create trace:', e);
    return doOperation();
  }
 
  try {
    const result = await doOperation();
    await trace.end({ output: result }).catch(console.error);
    return result;
  } catch (error) {
    await trace.error(error).catch(console.error);
    throw error;
  }
}

Timeout Handling

Don't let observability hang:

async function safeFlush(timeout = 5000): Promise<void> {
  return Promise.race([
    obs.shutdown(),
    new Promise((_, reject) =>
      setTimeout(() => reject(new Error('Flush timeout')), timeout)
    ),
  ]).catch(console.error);
}

Monitoring Observability

Health Checks

Monitor your monitoring:

// Metrics to track
const metrics = {
  tracesCreated: 0,
  tracesSent: 0,
  tracesFailed: 0,
  flushErrors: 0,
};
 
// Export for monitoring
app.get('/metrics', (req, res) => {
  res.json(metrics);
});

Alerts

Set up alerts for:

  • High error rate in observability
  • Flush failures
  • Unusual trace volume
  • Missing traces from critical paths

Cost Management

Estimate Costs

// Estimate traces per day
const tracesPerDay = requestsPerDay * sampleRate;
 
// Estimate tokens logged
const avgTokensPerTrace = 1000;
const tokensPerDay = tracesPerDay * avgTokensPerTrace;

Cost Optimization

  1. Sample wisely: Don't trace everything
  2. Truncate large inputs: Limit logged content size
  3. Aggregate similar traces: Reduce unique traces
  4. Set retention policies: Delete old data

Checklist

Pre-Production

  • DSN configured via environment variable
  • Sampling strategy implemented
  • PII filtering in place
  • Graceful shutdown handling
  • Error handling for observability failures
  • Cost estimates reviewed

Post-Launch

  • Monitor trace volume
  • Check error rates
  • Review sample of traces
  • Verify sensitive data not logged
  • Set up alerts

Next Steps