Transactional

Response Caching

Reduce costs and latency with intelligent response caching.

Overview

AI Gateway can cache identical LLM requests to reduce costs and improve response times. When enabled, requests with the same parameters return cached responses instantly.

How It Works

Request → Hash(model + messages + params) → Cache Lookup
    ↓
Cache Hit? → Return cached response (instant)
    ↓
Cache Miss? → Forward to provider → Cache response → Return
  1. AI Gateway generates a hash from your request parameters
  2. If a cached response exists and hasn't expired, it's returned immediately
  3. If not, the request goes to the provider and the response is cached

Enabling Caching

Dashboard Configuration

  1. Navigate to AI Gateway Settings
  2. Under "Cache Settings", toggle Enable Caching on
  3. Set your default TTL (time-to-live)
  4. Click Save

Per-Request Control

Disable caching for specific requests using a header:

const response = await fetch('https://api.transactional.dev/ai/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${process.env.GATEWAY_API_KEY}`,
    'Content-Type': 'application/json',
    'X-Cache-Control': 'no-cache',  // Skip cache for this request
  },
  body: JSON.stringify({
    model: 'gpt-4o',
    messages: [{ role: 'user', content: 'Hello!' }],
  }),
});

Cache Key Generation

The cache key is generated from:

  • Model name
  • Messages array (content and roles)
  • Temperature
  • Max tokens
  • Top P
  • Other sampling parameters

What Affects the Cache Key

IncludedNot Included
modeluser (user ID)
messagesRequest timestamps
temperatureAPI key
max_tokensHeaders
top_p
stop sequences
tools / functions

Example

These two requests will hit the same cache:

// Request 1
await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'Hello!' }],
  temperature: 0.7,
});
 
// Request 2 - Same cache key!
await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'Hello!' }],
  temperature: 0.7,
});

But this one won't (different temperature):

// Request 3 - Different cache key
await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'Hello!' }],
  temperature: 0.5,  // Different!
});

TTL Configuration

Set how long cached responses are valid:

SettingUse Case
1 hourReal-time data, frequently changing content
24 hoursSemi-static content, general queries
7 daysStatic content, reference lookups
30 daysRarely changing data

Per-Model TTL

Configure different TTLs per model in Settings:

ModelTTL
gpt-4o24 hours
gpt-3.5-turbo7 days
claude-3-haiku7 days

Cache Monitoring

Response Headers

Check cache status in response headers:

X-Cache: HIT       # Served from cache
X-Cache: MISS      # Fresh response from provider
X-Cache-TTL: 3600  # Seconds until expiry

Dashboard Metrics

View cache performance in the dashboard:

  • Hit Rate: Percentage of requests served from cache
  • Cost Savings: Estimated savings from cached responses
  • Cache Size: Total cached responses

When to Disable Caching

Disable caching for:

  1. Non-deterministic requests - When you need different responses
  2. Real-time data - Stock prices, weather, current events
  3. Personalized content - User-specific recommendations
  4. Testing - During development and debugging

Disabling Methods

Per-request header:

headers: { 'X-Cache-Control': 'no-cache' }

Temperature-based:

Using temperature > 0 introduces randomness, but identical requests still cache. Use no-cache header for truly random responses.

Cost Savings Calculator

Estimate your savings with caching:

MetricFormula
Requests CachedTotal Requests × Cache Hit Rate
Token SavingsCached Requests × Avg Tokens
Cost SavingsToken Savings × Token Price

Example:

  • 100,000 requests/month
  • 60% cache hit rate
  • 500 avg tokens per request
  • $0.01 per 1K tokens

Savings = 60,000 × 500 × ($0.01/1000) = $300/month

Best Practices

1. Normalize Inputs

Ensure consistent formatting for better cache hits:

// Normalize user input
const normalizedMessage = userInput.trim().toLowerCase();
 
await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: normalizedMessage }],
});

2. Use Deterministic Parameters

Set temperature: 0 for consistent, cacheable results:

await openai.chat.completions.create({
  model: 'gpt-4o',
  temperature: 0,  // Deterministic output
  messages: [...],
});

3. Separate Cacheable Requests

Split requests into cacheable and non-cacheable parts:

// Cacheable: Static system prompt processing
const systemContext = await openai.chat.completions.create({
  model: 'gpt-4o',
  temperature: 0,
  messages: [{ role: 'user', content: 'Summarize our product features' }],
});
 
// Non-cacheable: Dynamic user interaction
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [
    { role: 'system', content: systemContext.choices[0].message.content },
    { role: 'user', content: dynamicUserQuestion },
  ],
}, {
  headers: { 'X-Cache-Control': 'no-cache' },
});

Cache Invalidation

Clear cached responses when needed:

  1. Go to Settings > Cache Settings
  2. Click Clear Cache
  3. Optionally filter by model or date range

Note: Cache invalidation is immediate. New requests will be forwarded to providers.

Next Steps