Response Caching
Reduce costs and latency with intelligent response caching.
Overview
AI Gateway can cache identical LLM requests to reduce costs and improve response times. When enabled, requests with the same parameters return cached responses instantly.
How It Works
Request → Hash(model + messages + params) → Cache Lookup
↓
Cache Hit? → Return cached response (instant)
↓
Cache Miss? → Forward to provider → Cache response → Return
- AI Gateway generates a hash from your request parameters
- If a cached response exists and hasn't expired, it's returned immediately
- If not, the request goes to the provider and the response is cached
Enabling Caching
Dashboard Configuration
- Navigate to AI Gateway Settings
- Under "Cache Settings", toggle Enable Caching on
- Set your default TTL (time-to-live)
- Click Save
Per-Request Control
Disable caching for specific requests using a header:
const response = await fetch('https://api.transactional.dev/ai/v1/chat/completions', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.GATEWAY_API_KEY}`,
'Content-Type': 'application/json',
'X-Cache-Control': 'no-cache', // Skip cache for this request
},
body: JSON.stringify({
model: 'gpt-4o',
messages: [{ role: 'user', content: 'Hello!' }],
}),
});Cache Key Generation
The cache key is generated from:
- Model name
- Messages array (content and roles)
- Temperature
- Max tokens
- Top P
- Other sampling parameters
What Affects the Cache Key
| Included | Not Included |
|---|---|
model | user (user ID) |
messages | Request timestamps |
temperature | API key |
max_tokens | Headers |
top_p | |
stop sequences | |
tools / functions |
Example
These two requests will hit the same cache:
// Request 1
await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: 'Hello!' }],
temperature: 0.7,
});
// Request 2 - Same cache key!
await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: 'Hello!' }],
temperature: 0.7,
});But this one won't (different temperature):
// Request 3 - Different cache key
await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: 'Hello!' }],
temperature: 0.5, // Different!
});TTL Configuration
Set how long cached responses are valid:
| Setting | Use Case |
|---|---|
| 1 hour | Real-time data, frequently changing content |
| 24 hours | Semi-static content, general queries |
| 7 days | Static content, reference lookups |
| 30 days | Rarely changing data |
Per-Model TTL
Configure different TTLs per model in Settings:
| Model | TTL |
|---|---|
| gpt-4o | 24 hours |
| gpt-3.5-turbo | 7 days |
| claude-3-haiku | 7 days |
Cache Monitoring
Response Headers
Check cache status in response headers:
X-Cache: HIT # Served from cache
X-Cache: MISS # Fresh response from provider
X-Cache-TTL: 3600 # Seconds until expiry
Dashboard Metrics
View cache performance in the dashboard:
- Hit Rate: Percentage of requests served from cache
- Cost Savings: Estimated savings from cached responses
- Cache Size: Total cached responses
When to Disable Caching
Disable caching for:
- Non-deterministic requests - When you need different responses
- Real-time data - Stock prices, weather, current events
- Personalized content - User-specific recommendations
- Testing - During development and debugging
Disabling Methods
Per-request header:
headers: { 'X-Cache-Control': 'no-cache' }Temperature-based:
Using temperature > 0 introduces randomness, but identical requests still cache. Use no-cache header for truly random responses.
Cost Savings Calculator
Estimate your savings with caching:
| Metric | Formula |
|---|---|
| Requests Cached | Total Requests × Cache Hit Rate |
| Token Savings | Cached Requests × Avg Tokens |
| Cost Savings | Token Savings × Token Price |
Example:
- 100,000 requests/month
- 60% cache hit rate
- 500 avg tokens per request
- $0.01 per 1K tokens
Savings = 60,000 × 500 × ($0.01/1000) = $300/month
Best Practices
1. Normalize Inputs
Ensure consistent formatting for better cache hits:
// Normalize user input
const normalizedMessage = userInput.trim().toLowerCase();
await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: normalizedMessage }],
});2. Use Deterministic Parameters
Set temperature: 0 for consistent, cacheable results:
await openai.chat.completions.create({
model: 'gpt-4o',
temperature: 0, // Deterministic output
messages: [...],
});3. Separate Cacheable Requests
Split requests into cacheable and non-cacheable parts:
// Cacheable: Static system prompt processing
const systemContext = await openai.chat.completions.create({
model: 'gpt-4o',
temperature: 0,
messages: [{ role: 'user', content: 'Summarize our product features' }],
});
// Non-cacheable: Dynamic user interaction
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{ role: 'system', content: systemContext.choices[0].message.content },
{ role: 'user', content: dynamicUserQuestion },
],
}, {
headers: { 'X-Cache-Control': 'no-cache' },
});Cache Invalidation
Clear cached responses when needed:
- Go to Settings > Cache Settings
- Click Clear Cache
- Optionally filter by model or date range
Note: Cache invalidation is immediate. New requests will be forwarded to providers.
Next Steps
On This Page
- Overview
- How It Works
- Enabling Caching
- Dashboard Configuration
- Per-Request Control
- Cache Key Generation
- What Affects the Cache Key
- Example
- TTL Configuration
- Per-Model TTL
- Cache Monitoring
- Response Headers
- Dashboard Metrics
- When to Disable Caching
- Disabling Methods
- Cost Savings Calculator
- Best Practices
- 1. Normalize Inputs
- 2. Use Deterministic Parameters
- 3. Separate Cacheable Requests
- Cache Invalidation
- Next Steps