Voice UIs - Latency Metrics

6 min read Original article ↗

I recently built a voice form for halfsies, a Sptlitwise clone (read more about the lessons and questions that came up at the link below) and decided to put down the latency metrics that I was seeing using different LLMs.

Designing Voice Based Forms

Note: This post is just to give you an indicative sense of the user perceived latency metrics for this specific setup (detailed below). These numbers will change if you choose a different framework, STT, TTS, LLM, which region your user is in, which region your agent is deployed in etc.

(Skip to the ‘Latency Metrics’s section below if you want to straight jump into the numbers)

  1. I built the form using the LiveKit framework and used the ‘pipeline’ approach where there the components involved are: STT (speech-to-text), LLM and TTS (text-to-speech) (the other option was to use a speech-to-speech model like gpt-realtime)

  2. The STT was the Deepgram Nova-3

  3. I tried it with 2 LLMs: the gpt-5-mini and llama 3.1-8b on Cerebras

  4. The TTS was the Cartesia sonic-3

  5. Silero as the VAD (voice activity detection) model (LiveKit supports this natively)

  6. The LiveKit agent is hosted in us-east-1

  7. The user (me!) is in India

  1. User starts speaking and LiveKit starts streaming text into the STT websocket

  2. User finishes speaking

  3. EOU detector detects end of turn

  4. LiveKit sends prompt to LLM

  5. LLM returns first token (TTFT)

  6. TTS starts speaking while LLM continues streaming

  7. User hears response

In this specific pipeline setup, the following components contribute to user perceived latency.

  1. The time between the moment the user actually stops speaking (true end of audio) and the moment the EOU model (silero in this case) decides “the user has stopped, go ahead with next steps”

  2. Another way to put this is: time from the last audio frame of the user → time when the agent decides the user finished speaking.

  1. As soon as the user speaks, audio is pushed frame-by-frame into the STT websocket (Deepgram in this case) by LiveKit and it produces partial transcripts immediately.

  2. The STT engine continuously emits: interim/partial transcripts, timing metadata, confidence scores. These come in real time, while the user is still speaking. The STT does NOT wait for EOU to transcribe.

  3. When EOU fires, the agent knows “the user is done speaking.” and then LiveKit will wait for STT to deliver the final full transcript (with punctuation + corrections). Once the final transcript is delivered: LLM request is triggered.

  4. So STT runs continuously before EOU, and EOU only controls when the final transcript is consumed by the next step in the pipeline: the LLM. Basically the transcriptions and EOU detection are running in parallel.

  5. Why does transcription delay exist if it’s transcribing in real time? STT engines need a small extra delay after speech ends to: finalize punctuation, merge partial outputs, output a final coherent sentence etc

  1. The amount of time (seconds) that it took for the LLM to generate the first token of the completion. Basically time gap between: LLM request → first token emitted by LLM

  2. Why are we measuring this? If you read above in how this pipeline works - the TTFT is on the critical path and not the LLM total generation time.

  1. We measure TTFB (Time To First Byte) for exactly the same reason we measure TTFT — because in a real-time, voice-streaming agent, only the time until the first audio byte matters for user-perceived responsiveness, not the total TTS generation time.

  2. TTFB in TTS = the time from sending text → until the TTS service produces the first chunk of audio.

Putting this all together:

eou_delay = max(vad_delay, transcription_delay) (since they happen in parallel)

User Perceived Latency = turn_gap + llm.ttft + tts.ttfb

Now the data:

Sample audio:

\(\begin{array}{c|c|c|c|c} \textbf{Turn} & \textbf{EOU Delay} & \textbf{LLM TTFT} & \textbf{TTS TTFB} & \textbf{User Perceived Latency} \\ \hline 1 & 0.5667 & 0.5410 & 0.2355 & 1.3432 \\ 2 & 0.5793 & 0.6484 & 0.2156 & 1.4433 \\ & & & & \\ 1 & 0.5673 & 1.0730 & 0.2311 & 1.8714 \\ 2 & 0.5821 & 0.7939 & 0.2444 & 1.6204 \\ 3 & 0.5852 & 0.6411 & 0.2282 & 1.4545 \\ & & & & \\ 1 & 0.5689 & 0.6336 & 0.2280 & 1.4305 \\ 2 & 0.5759 & 0.8004 & 0.2195 & 1.5958 \\ & & & & \\ 1 & 0.5671 & 0.5722 & 0.2298 & 1.3691 \\ 2 & 0.5763 & 0.8044 & 0.2332 & 1.6139 \\ & & & & \\ 1 & 0.5690 & 0.4727 & 0.2170 & 1.2587 \\ 2 & 0.5778 & 0.5990 & 0.2351 & 1.4119 \\ 3 & 0.6123 & 0.6435 & 0.2233 & 1.4791 \\ \end{array}\)

Average = 1.49s (all data in seconds)

LLM = Llama 3.1 8b on Cerebras (all other components in pipeline remain the same):

Sample audio:

\(\begin{array}{c|c|c|c|c} \textbf{Turn} & \textbf{EOU Delay} & \textbf{LLM TTFT} & \textbf{TTS TTFB} & \textbf{User Perceived Latency} \\ \hline 1 & 0.5750 & 0.2658 & 0.2092 & 1.0500 \\ 2 & 0.5836 & 0.4803 & 0.2165 & 1.2804 \\ 3 & 0.5944 & 0.2754 & 0.2221 & 1.0919 \\ & & & & \\ 1 & 0.5737 & 0.3478 & 0.2186 & 1.1401 \\ 2 & 0.5872 & 0.2801 & 0.2272 & 1.0945 \\ 3 & 0.5906 & 0.2747 & 0.2366 & 1.1019 \\ 4 & 0.5323 & 0.2712 & 0.2158 & 1.0193 \\ 5 & 0.5845 & 0.3067 & 0.2504 & 1.1415 \\ & & & & \\ 1 & 0.5654 & 0.2649 & 0.2120 & 1.0422 \\ 2 & 0.5752 & 0.3735 & 0.2125 & 1.1612 \\ 3 & 0.5331 & 0.2773 & 0.2120 & 1.0225 \\ \end{array}\)

Average = 1.10s (all data in seconds)

As you can see, the latency of 1.1s with the llama-3.1-cerebras is 26% lower than gpt-5-mini, 1.49s. Notice that the LLM TTFT for gpt-5-mini seems to be around 500-60ms and for llama seems to be around 200-300ms. That’s a big difference! I actually decided to use Cerebras because the latency for gpt-5-mini was so high and the conversation didn’t feel natural at all!

BUT - if you heard both the conversations, you will notice that there is this (uncomfortable) gap when waiting for the agent to respond - in both cases. So, even with the performance gains with cerebras + llama - it still feels…off!

Then generally accepted benchmark is: For the conversation to feel natural, the latency should be within 500ms. , both these choices fall short of that target. While latency isn’t the only metric you would compare when choosing between 2 LLMs for your voice experience - it is one of the most important inputs in making that choice.

I eventually decided to move to gpt-realtime since my use case is simple and the experience felt more “natural” - I’ll post about that soon!

Leave a comment