Engineering

12 min read

We Built an AI Gateway That Routes Across 13 LLM Providers. Here is How.

Architecture deep-dive into building a unified LLM proxy that routes requests across OpenAI, Anthropic, Google, Mistral, and more with load balancing, failover, and schema normalization.

Transactional Team

Jan 22, 2026

{ }

12 min read

We Built an AI Gateway That Routes Across 13 LLM Providers. Here is How.

The Problem With Talking to 13 Different APIs

Many production AI stacks start with exactly one LLM provider. Every request goes through a single API, every bill comes from one vendor, and every outage takes the entire AI pipeline down.

Then Claude gets good. Then Gemini gets fast. Then Mistral gets cheap. Suddenly a team needs four providers in production, and a clean single-SDK integration turns into a sprawling mess of provider-specific clients, incompatible schemas, and retry logic duplicated across every service.

In a typical multi-provider codebase, thousands of lines of code end up dedicated purely to LLM provider abstraction, scattered across dozens of files. That is when a gateway becomes essential.

AI Gateway Production Stats

Failover Recovery

Cache Hit (Classification)

Cache Hit (Chat)

Latency Overhead (ms)

What an AI Gateway Actually Does

At its core, an AI gateway is a reverse proxy that sits between your application and LLM providers. Your app sends requests in a single format. The gateway translates, routes, and manages everything else.

Your Application
      |
      v
  AI Gateway (single endpoint)
      |
      ├── Schema Normalization
      ├── Routing & Load Balancing
      ├── Caching Layer
      ├── Rate Limiting
      ├── Observability
      |
      v
  ┌─────────┬───────────┬────────┬─────────┐
  │ OpenAI  │ Anthropic │ Google │ Mistral │ ...
  └─────────┴───────────┴────────┴─────────┘

But the devil is in the details. Every provider has a different request format, different error codes, different streaming protocols, and different ideas about what a "message" looks like.

Schema Normalization: The Hardest Part

The OpenAI chat completion format serves as the canonical schema. Not because it is the best, but because it is the most widely adopted. If you have used the OpenAI SDK, you already know our API.

The normalization layer handles three transformations:

Request Transformation

Every incoming request follows the OpenAI format. The gateway transforms it into the provider-specific format before forwarding.

// What your app sends (OpenAI format)
{
  model: "anthropic/claude-sonnet-4-20250514",
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "Explain TCP handshake." }
  ],
  max_tokens: 1024,
  temperature: 0.7
}
 
// What the gateway sends to Anthropic
{
  model: "claude-sonnet-4-20250514",
  system: "You are a helpful assistant.",
  messages: [
    { role: "user", content: "Explain TCP handshake." }
  ],
  max_tokens: 1024,
  temperature: 0.7
}

Notice how Anthropic separates the system prompt from the messages array. Google's Gemini does something different again -- it uses contents instead of messages and parts instead of content. Each provider has these kinds of structural differences.

Response Normalization

Responses coming back from providers get normalized into the OpenAI format before reaching your app. This means your parsing code never changes, regardless of which provider actually served the request.

interface NormalizedResponse {
  id: string;
  object: "chat.completion";
  created: number;
  model: string;
  choices: [{
    index: number;
    message: { role: "assistant"; content: string };
    finish_reason: "stop" | "length" | "tool_calls";
  }];
  usage: {
    prompt_tokens: number;
    completion_tokens: number;
    total_tokens: number;
  };
}

Stream Normalization

This was the hardest piece. OpenAI uses Server-Sent Events with data: [DONE] termination. Anthropic uses SSE with typed events (message_start, content_block_delta, message_stop). Google uses a completely different streaming protocol.

All streams are normalized into OpenAI-compatible SSE chunks:

// Anthropic stream event
{ type: "content_block_delta", delta: { type: "text_delta", text: "Hello" } }
 
// Normalized to OpenAI SSE format
data: {"choices":[{"delta":{"content":"Hello"}}]}

The stream normalizer maintains a state machine per provider that tracks the current message state and emits normalized chunks. It handles edge cases like Anthropic's separate input_json deltas for tool calls and Google's aggregated response chunks.

The Routing Layer

Routing determines which provider handles each request. There are three common strategies.

Model-Based Routing

The simplest approach. The model string contains the provider prefix:

// Route by model prefix
const providerMap: Record<string, Provider> = {
  "openai/": Provider.OPENAI,
  "anthropic/": Provider.ANTHROPIC,
  "google/": Provider.GOOGLE,
  "mistral/": Provider.MISTRAL,
  "deepseek/": Provider.DEEPSEEK,
  "meta/": Provider.META,
  "cohere/": Provider.COHERE,
  // ... 13 providers total
};
 
function resolveProvider(model: string): { provider: Provider; model: string } {
  for (const [prefix, provider] of Object.entries(providerMap)) {
    if (model.startsWith(prefix)) {
      return { provider, model: model.slice(prefix.length) };
    }
  }
  throw new GatewayError("UNKNOWN_MODEL", `No provider for model: ${model}`);
}

Failover Routing

When a provider returns a 5xx error, rate limit, or times out, the gateway automatically retries with the next provider in the fallback chain.

const fallbackChains: Record<string, string[]> = {
  "openai/gpt-4o": [
    "openai/gpt-4o",
    "anthropic/claude-sonnet-4-20250514",
    "google/gemini-2.0-flash"
  ],
};

The key insight is that failover must be fast. Setting aggressive timeouts per provider (10 seconds for the primary, 8 for secondary) and make the failover decision based on the first byte latency, not total response time. If a provider has not started streaming within the timeout, the gateway cuts over.

Load Balancing

For providers that support multiple API keys or endpoints, the gateway distributes load using weighted round-robin. This is particularly useful for Azure OpenAI deployments where you might have multiple regional endpoints with different rate limits.

interface ProviderEndpoint {
  baseUrl: string;
  apiKey: string;
  weight: number;        // Higher weight = more traffic
  currentLoad: number;   // Active request count
  maxConcurrent: number; // Rate limit ceiling
}

Error Handling Across Providers

Every provider has different error codes. OpenAI returns 429 for rate limits. Anthropic returns 429 with an overloaded error type. Google returns 429 but sometimes 503. We normalize all of these into a consistent error schema:

enum GatewayErrorCode {
  RATE_LIMITED = "rate_limited",
  CONTEXT_LENGTH_EXCEEDED = "context_length_exceeded",
  INVALID_MODEL = "invalid_model",
  PROVIDER_ERROR = "provider_error",
  TIMEOUT = "timeout",
  CONTENT_FILTERED = "content_filtered",
}

Each provider adapter maps its native errors into these codes. This means your error handling code works identically regardless of which provider errored.

Caching: The 90% Cost Reducer

Caching operates at two levels:

Exact match caching uses a hash of the normalized request (model, messages, parameters) to serve identical requests from cache. This catches automated pipelines that make the same call repeatedly.

Semantic caching uses embeddings to find similar-enough previous requests. If someone asked "What is the capital of France?" and then "What's France's capital city?", semantic caching catches that. A similarity threshold of 0.95 works well by default -- high enough to avoid incorrect cache hits, low enough to catch meaningful duplicates.

The cache is scoped per API key. One customer's cache never leaks into another's.

Observability: Every Token Counted

Every request through the gateway generates a structured log entry:

interface GatewayLog {
  requestId: string;
  timestamp: number;
  provider: string;
  model: string;
  latencyMs: number;
  timeToFirstTokenMs: number;
  promptTokens: number;
  completionTokens: number;
  totalTokens: number;
  costUsd: number;
  cacheHit: boolean;
  failoverAttempts: number;
  statusCode: number;
  error?: GatewayErrorCode;
}

Cost is calculated using per-model pricing tables updated weekly. This gives teams real-time visibility into what their AI features actually cost, broken down by model, endpoint, and user.

What We Would Do Differently

Based on experience building this type of system, a few things worth noting:

Start with fewer providers. Launch with a handful and add incrementally. Teams that try to support every provider on day one never ship.

Invest in stream testing early. Stream normalization bugs are the hardest to reproduce. A stream replay system that records raw provider streams and replays them through the normalizer is invaluable. Build it early.

Do not abstract tool calling too early. OpenAI and Anthropic have fundamentally different approaches to tool calling. Trying to normalize them immediately leads to endless edge cases. A better approach is to normalize the format but preserve provider-specific behaviors behind feature flags.

The Numbers

Production numbers for a mature AI gateway:

Latency overhead: 3-8ms per request (mostly serialization)
Failover success rate: 94% of failed requests recover on a secondary provider
Cache hit rate: Varies wildly by use case. Classification pipelines hit 60-80%. Chat applications hit 5-15%.
Provider count: 13 providers, 47 models

At scale, the routing and normalization layer is never the bottleneck -- that honor goes to the providers themselves.

Key Takeaway

An AI gateway is not a nice-to-have abstraction. It is infrastructure. The moment you depend on more than one LLM provider -- and you will -- you need a translation layer that your application code never has to think about.

The alternative is provider-specific code scattered across your codebase, growing linearly with every new model you adopt. The gateway is better.

Check out our AI Gateway to see this architecture in action.

Sources & References

[1]OpenAI API Reference — OpenAI
[2]Anthropic API Reference — Anthropic
[3]Google Gemini API Documentation — Google
[4]Reverse Proxy Pattern — Microsoft

Written by

Transactional Team

Tags:

architecture

llm

Tutorials

Webhooks Will Fail. Here are the Retry and Idempotency Patterns That Save You.

Practical patterns for reliable webhook delivery: exponential backoff with jitter, idempotency keys, dead letter queues, and signature verification. TypeScript code included.

Transactional TeamMar 7, 2026

Industry Insights

We Evaluated 12 LLM Observability Tools. Most of Them Do Not Matter.

A practical evaluation of LLM observability tools across tracing, cost tracking, quality monitoring, and prompt management. What matters, what is marketing, and what to actually look for.

Transactional TeamMar 5, 2026

Case Studies

An Enterprise Team Was Shipping Hallucinations to Users. Traces Showed Them Where.

How an enterprise company with AI-powered customer support reduced hallucination rates from 8% to 0.3% and cut AI issue MTTR from days to minutes using LLM observability and trace-level analysis.

Transactional TeamMar 4, 2026

YOUR AGENTS DESERVE
REAL INFRASTRUCTURE.

START BUILDING AGENTS THAT DO REAL WORK.

Deploy Your First Agent