Rate Limiting
Understanding and managing rate limits for AI Gateway.
Overview
AI Gateway implements rate limiting at multiple levels to ensure fair usage and protect against abuse. Understanding these limits helps you design resilient applications.
Rate Limit Tiers
| Plan | Requests/min | Requests/day | Tokens/day |
|---|---|---|---|
| Free | 20 | 1,000 | 100,000 |
| Pro | 200 | 50,000 | 10,000,000 |
| Team | 1,000 | 250,000 | 50,000,000 |
| Enterprise | Custom | Custom | Custom |
Rate Limit Headers
Every response includes rate limit information:
X-RateLimit-Limit: 200 # Max requests per minute
X-RateLimit-Remaining: 150 # Requests remaining
X-RateLimit-Reset: 1706140800 # Unix timestamp when limit resets
Types of Limits
1. Gateway Limits
Applied by AI Gateway to your account:
| Limit | Scope | Action |
|---|---|---|
| Requests per minute | Per API key | 429 error |
| Requests per day | Per organization | 429 error |
| Tokens per day | Per organization | 429 error |
2. Provider Limits
Applied by the underlying provider (OpenAI, Anthropic, etc.):
| Provider | Limit | AI Gateway Behavior |
|---|---|---|
| OpenAI | 3,500 RPM (GPT-4) | Triggers fallback |
| Anthropic | 1,000 RPM | Triggers fallback |
| 60 RPM (free tier) | Triggers fallback |
When a provider rate limits, AI Gateway can automatically fallback to another provider.
3. Per-Key Limits
Set custom limits per API key:
- Go to Settings > Gateway API Keys
- Click on a key to edit
- Set Rate Limit (requests/minute)
- Click Save
Handling Rate Limits
Check Before You're Limited
Monitor headers to predict limits:
const response = await openai.chat.completions.create({...});
// Check remaining quota
const remaining = response.headers.get('X-RateLimit-Remaining');
if (parseInt(remaining) < 10) {
console.warn('Approaching rate limit');
}Implement Retry Logic
Handle 429 errors with exponential backoff:
async function callWithRetry(fn: () => Promise<any>, maxRetries = 3) {
for (let i = 0; i < maxRetries; i++) {
try {
return await fn();
} catch (error) {
if (error.status === 429) {
const retryAfter = error.headers?.['retry-after'] || Math.pow(2, i);
console.log(`Rate limited, retrying in ${retryAfter}s`);
await sleep(retryAfter * 1000);
continue;
}
throw error;
}
}
throw new Error('Max retries exceeded');
}Use Queuing
For high-volume applications, implement a request queue:
import PQueue from 'p-queue';
const queue = new PQueue({
concurrency: 10, // Max concurrent requests
interval: 60000, // 1 minute
intervalCap: 200, // Max per interval
});
async function rateLimitedRequest(params) {
return queue.add(() => openai.chat.completions.create(params));
}Rate Limit Responses
429 Too Many Requests
{
"error": {
"code": "rate_limit_exceeded",
"message": "Rate limit exceeded. Please retry after 30 seconds.",
"type": "rate_limit_error"
}
}Headers:
HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 200
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1706140830
Increasing Limits
Upgrade Your Plan
Higher plans include higher limits:
- Go to Settings > Billing
- Click Upgrade Plan
- Select a higher tier
Request Enterprise Limits
For custom limits, contact sales:
- Custom requests per minute
- Custom daily quotas
- Dedicated capacity
- SLA guarantees
Best Practices
1. Implement Client-Side Throttling
Don't wait for 429 errors:
import Bottleneck from 'bottleneck';
const limiter = new Bottleneck({
maxConcurrent: 10,
minTime: 50, // 50ms between requests = 20/sec max
});
const rateLimitedCall = limiter.wrap(openai.chat.completions.create.bind(openai.chat.completions));2. Use Caching
Enable caching to reduce request volume:
// Identical requests hit cache, don't count against limits
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [...],
});3. Batch Similar Requests
Combine multiple queries when possible:
// Instead of 10 separate requests
for (const question of questions) {
await openai.chat.completions.create({...});
}
// Batch into one request
const combinedPrompt = questions.map((q, i) => `${i+1}. ${q}`).join('\n');
await openai.chat.completions.create({
messages: [{
role: 'user',
content: `Answer each question:\n${combinedPrompt}`
}],
});4. Monitor Usage
Track usage to predict limits:
- Go to AI Gateway > Analytics
- View request volume over time
- Set up alerts for usage thresholds
Monitoring & Alerts
Usage Dashboard
View real-time usage:
- Requests per minute graph
- Daily request count
- Token usage tracking
- Provider-specific breakdowns
Alert Configuration
Set up alerts before hitting limits:
- Go to Settings > Alerts
- Create alert:
- Metric: Request Rate
- Threshold: 80% of limit
- Channel: Email/Slack/Webhook
Next Steps
- Caching - Reduce request volume
- Fallback - Handle provider limits
- Cost Tracking - Monitor spending
On This Page
- Overview
- Rate Limit Tiers
- Rate Limit Headers
- Types of Limits
- 1. Gateway Limits
- 2. Provider Limits
- 3. Per-Key Limits
- Handling Rate Limits
- Check Before You're Limited
- Implement Retry Logic
- Use Queuing
- Rate Limit Responses
- 429 Too Many Requests
- Increasing Limits
- Upgrade Your Plan
- Request Enterprise Limits
- Best Practices
- 1. Implement Client-Side Throttling
- 2. Use Caching
- 3. Batch Similar Requests
- 4. Monitor Usage
- Monitoring & Alerts
- Usage Dashboard
- Alert Configuration
- Next Steps