Overview

AI Gateway supports streaming responses via Server-Sent Events (SSE). Streaming provides tokens as they're generated, reducing perceived latency and enabling real-time user experiences.

Enabling Streaming

With OpenAI SDK

import OpenAI from 'openai';
 
const openai = new OpenAI({
  baseURL: 'https://api.transactional.dev/ai/v1',
  apiKey: process.env.GATEWAY_API_KEY,
});
 
const stream = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'Tell me a story' }],
  stream: true,  // Enable streaming
});
 
for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content || '';
  process.stdout.write(content);
}

With Fetch API

const response = await fetch('https://api.transactional.dev/ai/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${process.env.GATEWAY_API_KEY}`,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    model: 'gpt-4o',
    messages: [{ role: 'user', content: 'Hello!' }],
    stream: true,
  }),
});
 
const reader = response.body.getReader();
const decoder = new TextDecoder();
 
while (true) {
  const { done, value } = await reader.read();
  if (done) break;
 
  const chunk = decoder.decode(value);
  const lines = chunk.split('\n').filter(line => line.startsWith('data: '));
 
  for (const line of lines) {
    const data = line.slice(6);
    if (data === '[DONE]') continue;
 
    const parsed = JSON.parse(data);
    const content = parsed.choices[0]?.delta?.content || '';
    process.stdout.write(content);
  }
}

Stream Format

Streaming responses follow the SSE format:

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1706140800,"model":"gpt-4o","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1706140800,"model":"gpt-4o","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1706140800,"model":"gpt-4o","choices":[{"index":0,"delta":{"content":"!"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1706140800,"model":"gpt-4o","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Chunk Structure

Each chunk contains:

interface StreamChunk {
  id: string;
  object: 'chat.completion.chunk';
  created: number;
  model: string;
  choices: [{
    index: number;
    delta: {
      role?: 'assistant';
      content?: string;
    };
    finish_reason: 'stop' | 'length' | null;
  }];
}

Framework Examples

Next.js API Route

// app/api/chat/route.ts
import { NextResponse } from 'next/server';
import OpenAI from 'openai';
 
const openai = new OpenAI({
  baseURL: 'https://api.transactional.dev/ai/v1',
  apiKey: process.env.GATEWAY_API_KEY,
});
 
export async function POST(req: Request) {
  const { messages } = await req.json();
 
  const stream = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages,
    stream: true,
  });
 
  const encoder = new TextEncoder();
  const readable = new ReadableStream({
    async start(controller) {
      for await (const chunk of stream) {
        const content = chunk.choices[0]?.delta?.content || '';
        controller.enqueue(encoder.encode(content));
      }
      controller.close();
    },
  });
 
  return new Response(readable, {
    headers: { 'Content-Type': 'text/plain' },
  });
}

React Frontend

// components/ChatStream.tsx
import { useState } from 'react';
 
function ChatStream() {
  const [response, setResponse] = useState('');
 
  const streamChat = async () => {
    const res = await fetch('/api/chat', {
      method: 'POST',
      body: JSON.stringify({ messages: [{ role: 'user', content: 'Hello!' }] }),
    });
 
    const reader = res.body.getReader();
    const decoder = new TextDecoder();
 
    while (true) {
      const { done, value } = await reader.read();
      if (done) break;
 
      const text = decoder.decode(value);
      setResponse(prev => prev + text);
    }
  };
 
  return (
    <div>
      <button onClick={streamChat}>Start Chat</button>
      <div>{response}</div>
    </div>
  );
}

Vercel AI SDK

import { OpenAIStream, StreamingTextResponse } from 'ai';
import OpenAI from 'openai';
 
const openai = new OpenAI({
  baseURL: 'https://api.transactional.dev/ai/v1',
  apiKey: process.env.GATEWAY_API_KEY,
});
 
export async function POST(req: Request) {
  const { messages } = await req.json();
 
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages,
    stream: true,
  });
 
  const stream = OpenAIStream(response);
  return new StreamingTextResponse(stream);
}

Streaming with Tool Calls

Stream responses can include tool calls:

const stream = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'What is the weather in Paris?' }],
  tools: [
    {
      type: 'function',
      function: {
        name: 'get_weather',
        parameters: {
          type: 'object',
          properties: {
            location: { type: 'string' }
          }
        }
      }
    }
  ],
  stream: true,
});
 
let toolCall = null;
 
for await (const chunk of stream) {
  const delta = chunk.choices[0]?.delta;
 
  if (delta?.tool_calls) {
    // Accumulate tool call data
    toolCall = toolCall || { name: '', arguments: '' };
    toolCall.name += delta.tool_calls[0]?.function?.name || '';
    toolCall.arguments += delta.tool_calls[0]?.function?.arguments || '';
  }
 
  if (delta?.content) {
    process.stdout.write(delta.content);
  }
}
 
if (toolCall) {
  console.log('Tool call:', toolCall);
}

Caching with Streaming

Streamed responses can be cached:

The complete response is assembled during streaming
After the stream completes, it's cached
Subsequent identical requests receive cached (non-streaming) responses

To force streaming even on cache hits:

// This will always stream, even from cache
const stream = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [...],
  stream: true,
}, {
  headers: { 'X-Cache-Stream': 'true' },
});

Error Handling

Handle stream errors gracefully:

try {
  const stream = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [...],
    stream: true,
  });
 
  for await (const chunk of stream) {
    // Process chunks
  }
} catch (error) {
  if (error.code === 'ECONNRESET') {
    // Connection was reset, retry
  } else if (error.status === 429) {
    // Rate limited
  } else {
    // Other error
  }
}

Performance Considerations

Time to First Token (TTFT)

Streaming reduces perceived latency:

Metric	Non-Streaming	Streaming
TTFT	~2-5 seconds	~200-500ms
Complete	~2-5 seconds	~2-5 seconds
User Experience	Wait for full response	See response build

Token Counting

Token counts are available after streaming completes:

let totalTokens = 0;
 
for await (const chunk of stream) {
  // Process content...
}
 
// Get usage from final chunk or separate API call

Next Steps

Caching - Cache streamed responses
Fallback - Handle stream errors
API Reference - Full endpoint docs