Realtime API | OpenAI API

6 min read Original article ↗

Build low-latency, multimodal LLM applications with the Realtime API.

The OpenAI Realtime API enables low-latency communication with models that natively support speech-to-speech interactions as well as multimodal inputs (audio, images, and text) and outputs (audio and text). These APIs can also be used for realtime audio transcription.

Voice agents

One of the most common use cases for the Realtime API is building voice agents for speech-to-speech model interactions in the browser. Our recommended starting point for these types of applications is the Agents SDK for TypeScript, which uses a WebRTC connection to the Realtime model in the browser, and WebSocket when used on the server.

1
2
3
4
5
6
7
8
9
10
11
12
13
import { RealtimeAgent, RealtimeSession } from "@openai/agents/realtime";

const agent = new RealtimeAgent({
    name: "Assistant",
    instructions: "You are a helpful assistant.",
});

const session = new RealtimeSession(agent);

// Automatically connects your microphone and audio output
await session.connect({
    apiKey: "<client-api-key>",
});

Follow the voice agent quickstart to build Realtime agents in the browser.

To use the Realtime API directly outside the context of voice agents, check out the other connection options below.

Connection methods

While building voice agents with the Agents SDK is the fastest path to one specific type of application, the Realtime API provides an entire suite of flexible tools for a variety of use cases.

There are three primary supported interfaces for the Realtime API:

Ideal for browser and client-side interactions with a Realtime model.

Ideal for middle tier server-side applications with consistent low-latency network connections.

Ideal for VoIP telephony connections.

Depending on how you'd like to connect to a Realtime model, check out one of the connection guides above to get started. You'll learn how to initialize a Realtime session, and how to interact with a Realtime model using client and server events.

API Usage

Once connected to a realtime model using one of the methods above, learn how to interact with the model in these usage guides.

Beta to GA migration

There are a few key differences between the interfaces in the Realtime beta API and the recently released GA API. Expand the topics below for more information about migrating from the beta interface to GA.

For REST API requests, WebSocket connections, and other interfaces with the Realtime API, beta users had to include the following header with each request:

This header should be removed for requests to the GA interface. To retain the behavior of the beta API, you should continue to include this header.

In the beta interface, there were multiple endpoints for generating ephemeral keys for either Realtime sessions or transcription sessions. In the GA interface, there is only one REST API endpoint used to generate keys - POST /v1/realtime/client_secrets.

To create a session and receive a client secret you can use to initialize a WebRTC or WebSocket connection on a client, you can request one like this using the appropriate session configuration:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
const sessionConfig = JSON.stringify({
    session: {
        type: "realtime",
        model: "gpt-realtime",
        audio: {
            output: { voice: "marin" },
        },
    },
});

const response = await fetch("https://api.openai.com/v1/realtime/client_secrets", {
    method: "POST",
    headers: {
        Authorization: `Bearer ${apiKey}`,
        "Content-Type": "application/json",
    },
    body: sessionConfig,
});

const data = await response.json();
console.log(data.value); // e.g. ek_68af296e8e408191a1120ab6383263c2

These tokens can safely be used in client environments like browsers and mobile applications.

When initializing a WebRTC session in the browser, the URL for obtaining remote session information via SDP is now /v1/realtime/calls:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
const baseUrl = "https://api.openai.com/v1/realtime/calls";
const model = "gpt-realtime";
const sdpResponse = await fetch(baseUrl, {
    method: "POST",
    body: offer.sdp,
    headers: {
        Authorization: `Bearer YOUR_EPHEMERAL_KEY_HERE`,
        "Content-Type": "application/sdp",
    },
});

const sdp = await sdpResponse.text();
const answer = { type: "answer", sdp };
await pc.setRemoteDescription(answer);

When creating or updating a Realtime session in the GA interface, you must now specify a session type, since now the same client event is used to create both speech-to-speech and transcription sessions. The options for the session type are:

  • realtime for speech-to-speech
  • transcription for realtime audio transcription
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import WebSocket from "ws";

const url = "wss://api.openai.com/v1/realtime?model=gpt-realtime";
const ws = new WebSocket(url, {
    headers: {
        Authorization: "Bearer " + process.env.OPENAI_API_KEY,
    },
});

ws.on("open", function open() {
    console.log("Connected to server.");

    // Send client events over the WebSocket once connected
    ws.send(
        JSON.stringify({
            type: "session.update",
            session: {
                type: "realtime",
                instructions: "Be extra nice today!",
            },
        })
    );
});

Configuration for input modalities and other properties have moved as well, notably output audio configuration like model voice. Check the API reference for the latest event shapes.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
ws.on("open", function open() {
    ws.send(
        JSON.stringify({
            type: "session.update",
            session: {
                type: "realtime",
                model: "gpt-realtime",
                audio: {
                    output: { voice: "marin" },
                },
            },
        })
    );
});

Finally, some event names have changed to reflect their new position in the event data model:

  • response.text.deltaresponse.output_text.delta
  • response.audio.deltaresponse.output_audio.delta
  • response.audio_transcript.deltaresponse.output_audio_transcript.delta

For response.output_item, the API has always had both .added and .done events, but for conversation level items the API previously only had .created, which by convention is emitted at the start when the item added.

We have added a .added and .done event to allow better ergonomics for developers when receiving events that need some loading time (such as MCP tool listing or input audio transcriptions if these were to be modeled as items in the future).

Current event shape for conversation items added:

1
2
3
4
5
6
{
    "event_id": "event_1920",
    "type": "conversation.item.created",
    "previous_item_id": "msg_002",
    "item": Item
}

New events to replace the above:

1
2
3
4
5
6
{
    "event_id": "event_1920",
    "type": "conversation.item.added",
    "previous_item_id": "msg_002",
    "item": Item
}
1
2
3
4
5
6
{
    "event_id": "event_1920",
    "type": "conversation.item.done",
    "previous_item_id": "msg_002",
    "item": Item
}

All Items

Realtime API sets an object=realtime.item param on all items in the GA interface.

Function Call Output

status : Realtime now accepts a no-op status field for the function call output item param. This aligns with the Responses API implementation.

Message

Assistant Message Content

The type properties of output assistant messages now align with the Responses API:

  • type=texttype=output_text (no change to text field name)
  • type=audiotype=output_audio (no change to audio field name)