Streaming
Real-time streaming responses with Server-Sent Events.
Overview
AI Gateway supports streaming responses via Server-Sent Events (SSE). Streaming provides tokens as they're generated, reducing perceived latency and enabling real-time user experiences.
Enabling Streaming
With OpenAI SDK
import OpenAI from 'openai';
const openai = new OpenAI({
baseURL: 'https://api.transactional.dev/ai/v1',
apiKey: process.env.GATEWAY_API_KEY,
});
const stream = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: 'Tell me a story' }],
stream: true, // Enable streaming
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content || '';
process.stdout.write(content);
}With Fetch API
const response = await fetch('https://api.transactional.dev/ai/v1/chat/completions', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.GATEWAY_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
model: 'gpt-4o',
messages: [{ role: 'user', content: 'Hello!' }],
stream: true,
}),
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
const lines = chunk.split('\n').filter(line => line.startsWith('data: '));
for (const line of lines) {
const data = line.slice(6);
if (data === '[DONE]') continue;
const parsed = JSON.parse(data);
const content = parsed.choices[0]?.delta?.content || '';
process.stdout.write(content);
}
}Stream Format
Streaming responses follow the SSE format:
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1706140800,"model":"gpt-4o","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1706140800,"model":"gpt-4o","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1706140800,"model":"gpt-4o","choices":[{"index":0,"delta":{"content":"!"},"finish_reason":null}]}
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1706140800,"model":"gpt-4o","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]
Chunk Structure
Each chunk contains:
interface StreamChunk {
id: string;
object: 'chat.completion.chunk';
created: number;
model: string;
choices: [{
index: number;
delta: {
role?: 'assistant';
content?: string;
};
finish_reason: 'stop' | 'length' | null;
}];
}Framework Examples
Next.js API Route
// app/api/chat/route.ts
import { NextResponse } from 'next/server';
import OpenAI from 'openai';
const openai = new OpenAI({
baseURL: 'https://api.transactional.dev/ai/v1',
apiKey: process.env.GATEWAY_API_KEY,
});
export async function POST(req: Request) {
const { messages } = await req.json();
const stream = await openai.chat.completions.create({
model: 'gpt-4o',
messages,
stream: true,
});
const encoder = new TextEncoder();
const readable = new ReadableStream({
async start(controller) {
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content || '';
controller.enqueue(encoder.encode(content));
}
controller.close();
},
});
return new Response(readable, {
headers: { 'Content-Type': 'text/plain' },
});
}React Frontend
// components/ChatStream.tsx
import { useState } from 'react';
function ChatStream() {
const [response, setResponse] = useState('');
const streamChat = async () => {
const res = await fetch('/api/chat', {
method: 'POST',
body: JSON.stringify({ messages: [{ role: 'user', content: 'Hello!' }] }),
});
const reader = res.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const text = decoder.decode(value);
setResponse(prev => prev + text);
}
};
return (
<div>
<button onClick={streamChat}>Start Chat</button>
<div>{response}</div>
</div>
);
}Vercel AI SDK
import { OpenAIStream, StreamingTextResponse } from 'ai';
import OpenAI from 'openai';
const openai = new OpenAI({
baseURL: 'https://api.transactional.dev/ai/v1',
apiKey: process.env.GATEWAY_API_KEY,
});
export async function POST(req: Request) {
const { messages } = await req.json();
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages,
stream: true,
});
const stream = OpenAIStream(response);
return new StreamingTextResponse(stream);
}Streaming with Tool Calls
Stream responses can include tool calls:
const stream = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: 'What is the weather in Paris?' }],
tools: [
{
type: 'function',
function: {
name: 'get_weather',
parameters: {
type: 'object',
properties: {
location: { type: 'string' }
}
}
}
}
],
stream: true,
});
let toolCall = null;
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta;
if (delta?.tool_calls) {
// Accumulate tool call data
toolCall = toolCall || { name: '', arguments: '' };
toolCall.name += delta.tool_calls[0]?.function?.name || '';
toolCall.arguments += delta.tool_calls[0]?.function?.arguments || '';
}
if (delta?.content) {
process.stdout.write(delta.content);
}
}
if (toolCall) {
console.log('Tool call:', toolCall);
}Caching with Streaming
Streamed responses can be cached:
- The complete response is assembled during streaming
- After the stream completes, it's cached
- Subsequent identical requests receive cached (non-streaming) responses
To force streaming even on cache hits:
// This will always stream, even from cache
const stream = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [...],
stream: true,
}, {
headers: { 'X-Cache-Stream': 'true' },
});Error Handling
Handle stream errors gracefully:
try {
const stream = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [...],
stream: true,
});
for await (const chunk of stream) {
// Process chunks
}
} catch (error) {
if (error.code === 'ECONNRESET') {
// Connection was reset, retry
} else if (error.status === 429) {
// Rate limited
} else {
// Other error
}
}Performance Considerations
Time to First Token (TTFT)
Streaming reduces perceived latency:
| Metric | Non-Streaming | Streaming |
|---|---|---|
| TTFT | ~2-5 seconds | ~200-500ms |
| Complete | ~2-5 seconds | ~2-5 seconds |
| User Experience | Wait for full response | See response build |
Token Counting
Token counts are available after streaming completes:
let totalTokens = 0;
for await (const chunk of stream) {
// Process content...
}
// Get usage from final chunk or separate API callNext Steps
- Caching - Cache streamed responses
- Fallback - Handle stream errors
- API Reference - Full endpoint docs