Overview

AI Gateway can cache identical LLM requests to reduce costs and improve response times. When enabled, requests with the same parameters return cached responses instantly.

How It Works

Request → Hash(model + messages + params) → Cache Lookup
    ↓
Cache Hit? → Return cached response (instant)
    ↓
Cache Miss? → Forward to provider → Cache response → Return

AI Gateway generates a hash from your request parameters
If a cached response exists and hasn't expired, it's returned immediately
If not, the request goes to the provider and the response is cached

Enabling Caching

Dashboard Configuration

Navigate to AI Gateway Settings
Under "Cache Settings", toggle Enable Caching on
Set your default TTL (time-to-live)
Click Save

Per-Request Control

Disable caching for specific requests using a header:

const response = await fetch('https://api.transactional.dev/ai/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${process.env.GATEWAY_API_KEY}`,
    'Content-Type': 'application/json',
    'X-Cache-Control': 'no-cache',  // Skip cache for this request
  },
  body: JSON.stringify({
    model: 'gpt-4o',
    messages: [{ role: 'user', content: 'Hello!' }],
  }),
});

Cache Key Generation

The cache key is generated from:

Model name
Messages array (content and roles)
Temperature
Max tokens
Top P
Other sampling parameters

What Affects the Cache Key

Included	Not Included
`model`	`user` (user ID)
`messages`	Request timestamps
`temperature`	API key
`max_tokens`	Headers
`top_p`
`stop` sequences
`tools` / functions

Example

These two requests will hit the same cache:

// Request 1
await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'Hello!' }],
  temperature: 0.7,
});
 
// Request 2 - Same cache key!
await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'Hello!' }],
  temperature: 0.7,
});

But this one won't (different temperature):

// Request 3 - Different cache key
await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'Hello!' }],
  temperature: 0.5,  // Different!
});

TTL Configuration

Set how long cached responses are valid:

Setting	Use Case
1 hour	Real-time data, frequently changing content
24 hours	Semi-static content, general queries
7 days	Static content, reference lookups
30 days	Rarely changing data

Per-Model TTL

Configure different TTLs per model in Settings:

Model	TTL
gpt-4o	24 hours
gpt-3.5-turbo	7 days
claude-3-haiku	7 days

Cache Monitoring

Response Headers

Check cache status in response headers:

X-Cache: HIT       # Served from cache
X-Cache: MISS      # Fresh response from provider
X-Cache-TTL: 3600  # Seconds until expiry

Dashboard Metrics

View cache performance in the dashboard:

Hit Rate: Percentage of requests served from cache
Cost Savings: Estimated savings from cached responses
Cache Size: Total cached responses

When to Disable Caching

Disable caching for:

Non-deterministic requests - When you need different responses
Real-time data - Stock prices, weather, current events
Personalized content - User-specific recommendations
Testing - During development and debugging

Disabling Methods

Per-request header:

headers: { 'X-Cache-Control': 'no-cache' }

Temperature-based:

Using temperature > 0 introduces randomness, but identical requests still cache. Use no-cache header for truly random responses.

Cost Savings Calculator

Estimate your savings with caching:

Metric	Formula
Requests Cached	Total Requests × Cache Hit Rate
Token Savings	Cached Requests × Avg Tokens
Cost Savings	Token Savings × Token Price

Example:

100,000 requests/month
60% cache hit rate
500 avg tokens per request
$0.01 per 1K tokens

Savings = 60,000 × 500 × ($0.01/1000) = $300/month

Best Practices

1. Normalize Inputs

Ensure consistent formatting for better cache hits:

// Normalize user input
const normalizedMessage = userInput.trim().toLowerCase();
 
await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: normalizedMessage }],
});

2. Use Deterministic Parameters

Set temperature: 0 for consistent, cacheable results:

await openai.chat.completions.create({
  model: 'gpt-4o',
  temperature: 0,  // Deterministic output
  messages: [...],
});

3. Separate Cacheable Requests

Split requests into cacheable and non-cacheable parts:

// Cacheable: Static system prompt processing
const systemContext = await openai.chat.completions.create({
  model: 'gpt-4o',
  temperature: 0,
  messages: [{ role: 'user', content: 'Summarize our product features' }],
});
 
// Non-cacheable: Dynamic user interaction
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [
    { role: 'system', content: systemContext.choices[0].message.content },
    { role: 'user', content: dynamicUserQuestion },
  ],
}, {
  headers: { 'X-Cache-Control': 'no-cache' },
});

Cache Invalidation

Clear cached responses when needed:

Go to Settings > Cache Settings
Click Clear Cache
Optionally filter by model or date range

Note: Cache invalidation is immediate. New requests will be forwarded to providers.

Response Caching