Overview

AI Gateway implements rate limiting at multiple levels to ensure fair usage and protect against abuse. Understanding these limits helps you design resilient applications.

Rate Limit Tiers

Plan	Requests/min	Requests/day	Tokens/day
Free	20	1,000	100,000
Pro	200	50,000	10,000,000
Team	1,000	250,000	50,000,000
Enterprise	Custom	Custom	Custom

Rate Limit Headers

Every response includes rate limit information:

X-RateLimit-Limit: 200        # Max requests per minute
X-RateLimit-Remaining: 150    # Requests remaining
X-RateLimit-Reset: 1706140800 # Unix timestamp when limit resets

Types of Limits

1. Gateway Limits

Applied by AI Gateway to your account:

Limit	Scope	Action
Requests per minute	Per API key	429 error
Requests per day	Per organization	429 error
Tokens per day	Per organization	429 error

2. Provider Limits

Applied by the underlying provider (OpenAI, Anthropic, etc.):

Provider	Limit	AI Gateway Behavior
OpenAI	3,500 RPM (GPT-4)	Triggers fallback
Anthropic	1,000 RPM	Triggers fallback
Google	60 RPM (free tier)	Triggers fallback

When a provider rate limits, AI Gateway can automatically fallback to another provider.

3. Per-Key Limits

Set custom limits per API key:

Go to Settings > Gateway API Keys
Click on a key to edit
Set Rate Limit (requests/minute)
Click Save

Handling Rate Limits

Check Before You're Limited

Monitor headers to predict limits:

const response = await openai.chat.completions.create({...});
 
// Check remaining quota
const remaining = response.headers.get('X-RateLimit-Remaining');
if (parseInt(remaining) < 10) {
  console.warn('Approaching rate limit');
}

Implement Retry Logic

Handle 429 errors with exponential backoff:

async function callWithRetry(fn: () => Promise<any>, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fn();
    } catch (error) {
      if (error.status === 429) {
        const retryAfter = error.headers?.['retry-after'] || Math.pow(2, i);
        console.log(`Rate limited, retrying in ${retryAfter}s`);
        await sleep(retryAfter * 1000);
        continue;
      }
      throw error;
    }
  }
  throw new Error('Max retries exceeded');
}

Use Queuing

For high-volume applications, implement a request queue:

import PQueue from 'p-queue';
 
const queue = new PQueue({
  concurrency: 10,      // Max concurrent requests
  interval: 60000,      // 1 minute
  intervalCap: 200,     // Max per interval
});
 
async function rateLimitedRequest(params) {
  return queue.add(() => openai.chat.completions.create(params));
}

Rate Limit Responses

429 Too Many Requests

{
  "error": {
    "code": "rate_limit_exceeded",
    "message": "Rate limit exceeded. Please retry after 30 seconds.",
    "type": "rate_limit_error"
  }
}

Headers:

HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 200
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1706140830

Increasing Limits

Upgrade Your Plan

Higher plans include higher limits:

Go to Settings > Billing
Click Upgrade Plan
Select a higher tier

Request Enterprise Limits

For custom limits, contact sales:

Custom requests per minute
Custom daily quotas
Dedicated capacity
SLA guarantees

Best Practices

1. Implement Client-Side Throttling

Don't wait for 429 errors:

import Bottleneck from 'bottleneck';
 
const limiter = new Bottleneck({
  maxConcurrent: 10,
  minTime: 50,  // 50ms between requests = 20/sec max
});
 
const rateLimitedCall = limiter.wrap(openai.chat.completions.create.bind(openai.chat.completions));

2. Use Caching

Enable caching to reduce request volume:

// Identical requests hit cache, don't count against limits
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [...],
});

3. Batch Similar Requests

Combine multiple queries when possible:

// Instead of 10 separate requests
for (const question of questions) {
  await openai.chat.completions.create({...});
}
 
// Batch into one request
const combinedPrompt = questions.map((q, i) => `${i+1}. ${q}`).join('\n');
await openai.chat.completions.create({
  messages: [{
    role: 'user',
    content: `Answer each question:\n${combinedPrompt}`
  }],
});

4. Monitor Usage

Track usage to predict limits:

Go to AI Gateway > Analytics
View request volume over time
Set up alerts for usage thresholds

Monitoring & Alerts

Usage Dashboard

View real-time usage:

Requests per minute graph
Daily request count
Token usage tracking
Provider-specific breakdowns

Alert Configuration

Set up alerts before hitting limits:

Go to Settings > Alerts
Create alert:
- Metric: Request Rate
- Threshold: 80% of limit
- Channel: Email/Slack/Webhook

Next Steps

Caching - Reduce request volume
Fallback - Handle provider limits
Cost Tracking - Monitor spending

Rate Limiting