We Built an AI Gateway That Routes Across 13 LLM Providers. Here is How.
Architecture deep-dive into building a unified LLM proxy that routes requests across OpenAI, Anthropic, Google, Mistral, and more with load balancing, failover, and schema normalization.
Transactional Team
Jan 22, 2026
{ }
12 min read
Share
The Problem With Talking to 13 Different APIs
Many production AI stacks start with exactly one LLM provider. Every request goes through a single API, every bill comes from one vendor, and every outage takes the entire AI pipeline down.
Then Claude gets good. Then Gemini gets fast. Then Mistral gets cheap. Suddenly a team needs four providers in production, and a clean single-SDK integration turns into a sprawling mess of provider-specific clients, incompatible schemas, and retry logic duplicated across every service.
In a typical multi-provider codebase, thousands of lines of code end up dedicated purely to LLM provider abstraction, scattered across dozens of files. That is when a gateway becomes essential.
AI Gateway Production Stats
Failover Recovery
94
Cache Hit (Classification)
70
Cache Hit (Chat)
10
Latency Overhead (ms)
8
What an AI Gateway Actually Does
At its core, an AI gateway is a reverse proxy that sits between your application and LLM providers. Your app sends requests in a single format. The gateway translates, routes, and manages everything else.
Your Application
|
v
AI Gateway (single endpoint)
|
├── Schema Normalization
├── Routing & Load Balancing
├── Caching Layer
├── Rate Limiting
├── Observability
|
v
┌─────────┬───────────┬────────┬─────────┐
│ OpenAI │ Anthropic │ Google │ Mistral │ ...
└─────────┴───────────┴────────┴─────────┘
But the devil is in the details. Every provider has a different request format, different error codes, different streaming protocols, and different ideas about what a "message" looks like.
Schema Normalization: The Hardest Part
The OpenAI chat completion format serves as the canonical schema. Not because it is the best, but because it is the most widely adopted. If you have used the OpenAI SDK, you already know our API.
The normalization layer handles three transformations:
Request Transformation
Every incoming request follows the OpenAI format. The gateway transforms it into the provider-specific format before forwarding.
// What your app sends (OpenAI format){ model: "anthropic/claude-sonnet-4-20250514", messages: [ { role: "system", content: "You are a helpful assistant." }, { role: "user", content: "Explain TCP handshake." } ], max_tokens: 1024, temperature: 0.7}// What the gateway sends to Anthropic{ model: "claude-sonnet-4-20250514", system: "You are a helpful assistant.", messages: [ { role: "user", content: "Explain TCP handshake." } ], max_tokens: 1024, temperature: 0.7}
Notice how Anthropic separates the system prompt from the messages array. Google's Gemini does something different again -- it uses contents instead of messages and parts instead of content. Each provider has these kinds of structural differences.
Response Normalization
Responses coming back from providers get normalized into the OpenAI format before reaching your app. This means your parsing code never changes, regardless of which provider actually served the request.
This was the hardest piece. OpenAI uses Server-Sent Events with data: [DONE] termination. Anthropic uses SSE with typed events (message_start, content_block_delta, message_stop). Google uses a completely different streaming protocol.
All streams are normalized into OpenAI-compatible SSE chunks:
The stream normalizer maintains a state machine per provider that tracks the current message state and emits normalized chunks. It handles edge cases like Anthropic's separate input_json deltas for tool calls and Google's aggregated response chunks.
The Routing Layer
Routing determines which provider handles each request. There are three common strategies.
Model-Based Routing
The simplest approach. The model string contains the provider prefix:
// Route by model prefixconst providerMap: Record<string, Provider> = { "openai/": Provider.OPENAI, "anthropic/": Provider.ANTHROPIC, "google/": Provider.GOOGLE, "mistral/": Provider.MISTRAL, "deepseek/": Provider.DEEPSEEK, "meta/": Provider.META, "cohere/": Provider.COHERE, // ... 13 providers total};function resolveProvider(model: string): { provider: Provider; model: string } { for (const [prefix, provider] of Object.entries(providerMap)) { if (model.startsWith(prefix)) { return { provider, model: model.slice(prefix.length) }; } } throw new GatewayError("UNKNOWN_MODEL", `No provider for model: ${model}`);}
Failover Routing
When a provider returns a 5xx error, rate limit, or times out, the gateway automatically retries with the next provider in the fallback chain.
The key insight is that failover must be fast. Setting aggressive timeouts per provider (10 seconds for the primary, 8 for secondary) and make the failover decision based on the first byte latency, not total response time. If a provider has not started streaming within the timeout, the gateway cuts over.
Load Balancing
For providers that support multiple API keys or endpoints, the gateway distributes load using weighted round-robin. This is particularly useful for Azure OpenAI deployments where you might have multiple regional endpoints with different rate limits.
Every provider has different error codes. OpenAI returns 429 for rate limits. Anthropic returns 429 with an overloaded error type. Google returns 429 but sometimes 503. We normalize all of these into a consistent error schema:
Each provider adapter maps its native errors into these codes. This means your error handling code works identically regardless of which provider errored.
Caching: The 90% Cost Reducer
Caching operates at two levels:
Exact match caching uses a hash of the normalized request (model, messages, parameters) to serve identical requests from cache. This catches automated pipelines that make the same call repeatedly.
Semantic caching uses embeddings to find similar-enough previous requests. If someone asked "What is the capital of France?" and then "What's France's capital city?", semantic caching catches that. A similarity threshold of 0.95 works well by default -- high enough to avoid incorrect cache hits, low enough to catch meaningful duplicates.
The cache is scoped per API key. One customer's cache never leaks into another's.
Observability: Every Token Counted
Every request through the gateway generates a structured log entry:
Cost is calculated using per-model pricing tables updated weekly. This gives teams real-time visibility into what their AI features actually cost, broken down by model, endpoint, and user.
What We Would Do Differently
Based on experience building this type of system, a few things worth noting:
Start with fewer providers. Launch with a handful and add incrementally. Teams that try to support every provider on day one never ship.
Invest in stream testing early. Stream normalization bugs are the hardest to reproduce. A stream replay system that records raw provider streams and replays them through the normalizer is invaluable. Build it early.
Do not abstract tool calling too early. OpenAI and Anthropic have fundamentally different approaches to tool calling. Trying to normalize them immediately leads to endless edge cases. A better approach is to normalize the format but preserve provider-specific behaviors behind feature flags.
The Numbers
Production numbers for a mature AI gateway:
Latency overhead: 3-8ms per request (mostly serialization)
Failover success rate: 94% of failed requests recover on a secondary provider
Cache hit rate: Varies wildly by use case. Classification pipelines hit 60-80%. Chat applications hit 5-15%.
Provider count: 13 providers, 47 models
At scale, the routing and normalization layer is never the bottleneck -- that honor goes to the providers themselves.
Key Takeaway
An AI gateway is not a nice-to-have abstraction. It is infrastructure. The moment you depend on more than one LLM provider -- and you will -- you need a translation layer that your application code never has to think about.
The alternative is provider-specific code scattered across your codebase, growing linearly with every new model you adopt. The gateway is better.
Check out our AI Gateway to see this architecture in action.