Using the OpenAI Reatime API in python

I didn’t find any code in Python to use the Realtime API, so I implemented an example myself. I’m in Europe, so I don’t have access to the ChatGPT advanced voice mode yet, but wanted to try it out.

It turns out there are quite a few pitfalls but the code works very well now.

Note: both Claude.ai (Sonnet 3.5) and ChatGPT (4o-preview) were unable to implement it. I gave them the documentation as the API is recent, but even after multiple requests for corrections, the code was still not working, and I had to write it myself. The async code is complex for the AIs, and they will often block the event loop.

Pitfalls

If you decide to implement this yourself, here are a few things to think about:

Record & play with the correct format: 24kHz, 16bits, byte strings.
Jerky audio: Record continuously and send data in parallel. If you alternate, the recording will be jerky. It is the same for receiving and playing.
Echo: If you play the result on your speakers, the microphone will record ChatGPT and interrupt itself. I used a headset but plan to implement echo suppression.
Realtime API is very expensive. I spend around 60$ to implement and test this.
There is no existing good library for async audio in Python
OpenAI documentation could be better.

Code And Explanation

Async Audio

Since I couldn’t find a good async audio library in Python, the first part involves implementing some async recording and playing. You will also need the cancellation ability: when the AI is interrupted, we need to stop the currently playing chunk of audio.

We can use an asyncio task, call the cancel function and catch the CancelledError exception.

As output and input, the realtime API requires a rate of 24kHz, 16bit samples with a byte string audio format.

I decided to use the “sounddevice” library, which has the stop ability during play.

Below is the code that I used:

Realtime Chat

Now that we have the async audio, we can implement the websocket part. We will use the websockets python library.

We need to handle the following two events:

‘response.audio.delta’: to append to audio that our will be played.
‘input_audio_buffer.speech_stopped’: to cancel the audio beeing played.

We use four parallel tasks: one for recording, one for playing, one for receiving, and one for sending. This avoids any jerkyness.

Then we have two asyncio.Queues, for outgoing audio and incoming audio. We also need a asyncio.Event to cancel the current played audio.

See the resulting code below:

Full code

See below for the complete example.

Note: If you try it, don’t forget to use a headset, or ChatGPT will interrupt itself: I haven’t implemented echo suppression.