Transactional

Rate Limiting

Understanding and managing rate limits for AI Gateway.

Overview

AI Gateway implements rate limiting at multiple levels to ensure fair usage and protect against abuse. Understanding these limits helps you design resilient applications.

Rate Limit Tiers

PlanRequests/minRequests/dayTokens/day
Free201,000100,000
Pro20050,00010,000,000
Team1,000250,00050,000,000
EnterpriseCustomCustomCustom

Rate Limit Headers

Every response includes rate limit information:

X-RateLimit-Limit: 200        # Max requests per minute
X-RateLimit-Remaining: 150    # Requests remaining
X-RateLimit-Reset: 1706140800 # Unix timestamp when limit resets

Types of Limits

1. Gateway Limits

Applied by AI Gateway to your account:

LimitScopeAction
Requests per minutePer API key429 error
Requests per dayPer organization429 error
Tokens per dayPer organization429 error

2. Provider Limits

Applied by the underlying provider (OpenAI, Anthropic, etc.):

ProviderLimitAI Gateway Behavior
OpenAI3,500 RPM (GPT-4)Triggers fallback
Anthropic1,000 RPMTriggers fallback
Google60 RPM (free tier)Triggers fallback

When a provider rate limits, AI Gateway can automatically fallback to another provider.

3. Per-Key Limits

Set custom limits per API key:

  1. Go to Settings > Gateway API Keys
  2. Click on a key to edit
  3. Set Rate Limit (requests/minute)
  4. Click Save

Handling Rate Limits

Check Before You're Limited

Monitor headers to predict limits:

const response = await openai.chat.completions.create({...});
 
// Check remaining quota
const remaining = response.headers.get('X-RateLimit-Remaining');
if (parseInt(remaining) < 10) {
  console.warn('Approaching rate limit');
}

Implement Retry Logic

Handle 429 errors with exponential backoff:

async function callWithRetry(fn: () => Promise<any>, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fn();
    } catch (error) {
      if (error.status === 429) {
        const retryAfter = error.headers?.['retry-after'] || Math.pow(2, i);
        console.log(`Rate limited, retrying in ${retryAfter}s`);
        await sleep(retryAfter * 1000);
        continue;
      }
      throw error;
    }
  }
  throw new Error('Max retries exceeded');
}

Use Queuing

For high-volume applications, implement a request queue:

import PQueue from 'p-queue';
 
const queue = new PQueue({
  concurrency: 10,      // Max concurrent requests
  interval: 60000,      // 1 minute
  intervalCap: 200,     // Max per interval
});
 
async function rateLimitedRequest(params) {
  return queue.add(() => openai.chat.completions.create(params));
}

Rate Limit Responses

429 Too Many Requests

{
  "error": {
    "code": "rate_limit_exceeded",
    "message": "Rate limit exceeded. Please retry after 30 seconds.",
    "type": "rate_limit_error"
  }
}

Headers:

HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 200
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1706140830

Increasing Limits

Upgrade Your Plan

Higher plans include higher limits:

  1. Go to Settings > Billing
  2. Click Upgrade Plan
  3. Select a higher tier

Request Enterprise Limits

For custom limits, contact sales:

  • Custom requests per minute
  • Custom daily quotas
  • Dedicated capacity
  • SLA guarantees

Best Practices

1. Implement Client-Side Throttling

Don't wait for 429 errors:

import Bottleneck from 'bottleneck';
 
const limiter = new Bottleneck({
  maxConcurrent: 10,
  minTime: 50,  // 50ms between requests = 20/sec max
});
 
const rateLimitedCall = limiter.wrap(openai.chat.completions.create.bind(openai.chat.completions));

2. Use Caching

Enable caching to reduce request volume:

// Identical requests hit cache, don't count against limits
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [...],
});

3. Batch Similar Requests

Combine multiple queries when possible:

// Instead of 10 separate requests
for (const question of questions) {
  await openai.chat.completions.create({...});
}
 
// Batch into one request
const combinedPrompt = questions.map((q, i) => `${i+1}. ${q}`).join('\n');
await openai.chat.completions.create({
  messages: [{
    role: 'user',
    content: `Answer each question:\n${combinedPrompt}`
  }],
});

4. Monitor Usage

Track usage to predict limits:

  1. Go to AI Gateway > Analytics
  2. View request volume over time
  3. Set up alerts for usage thresholds

Monitoring & Alerts

Usage Dashboard

View real-time usage:

  • Requests per minute graph
  • Daily request count
  • Token usage tracking
  • Provider-specific breakdowns

Alert Configuration

Set up alerts before hitting limits:

  1. Go to Settings > Alerts
  2. Create alert:
    • Metric: Request Rate
    • Threshold: 80% of limit
    • Channel: Email/Slack/Webhook

Next Steps