Grok Speech to Text and Text to Speech APIs

3 min read Original article ↗

Today, we are excited to announce two powerful standalone audio APIs: Grok Speech to Text (STT) and Grok Text to Speech (TTS). Built on the same stack that powers Grok Voice, Tesla vehicles, and Starlink customer support.

These standalone endpoints make it straightforward for developers to integrate high-quality speech features into any application, whether you're creating voice agents, real-time transcription tools, accessibility solutions, podcasts, or interactive audio experiences.

Speech to Text

High accuracy, low latency.

  • Generate transcripts from large audio files in milliseconds via our REST API
  • Transcribe speech in real time with our lowest latency WebSocket API

We’ve added powerful features like word-level timestamps, speaker diarization, and multichannel support. It further includes intelligent Inverse Text Normalization that correctly handles numbers, dates, currencies, and more.

Thank you for holding, Anghared Llewelyn Bowen. I see here your mortgage rate lock is set at 3.75% and is valid until March 10th, 2024. Oisin MacGiolla Phadraigh, once we receive your signed documents by February 15th, we can aim for a closing date on March 20th. If you have any concerns, please feel free to email me at a.bowen@bestbank.com.

Other Models

VOICE IN VS TEXT OUT

Thank you for holding, Anherd Lualin Bowen. I see here your mortgage rate lock is set at 3.75% and is valid until 03/10/2024. Oysen Magilla Fadrig, once we receive your signed documents by February, 15, we can aim for a closing date on March 20. If you have any concerns, please feel free to email me at a dot bowen at bestbank dot com.

Pricing

We keep pricing straightforward and predictable: Speech to Text is $0.10 per hour for batch and $0.20 per hour for streaming. Full details and current rate limits are available in the xAI API console.

Cost per hour (Batch)

Cost per hour (Streaming)

Enterprise-Grade Transcription

Grok STT is evaluated against the top commercial models on phone calls, meetings, video/podcasts, and telephony. It excels at entity recognition and business use cases like medical, legal, and financial.

Domain (Word Error Rate)Grok STTElevenLabsDeepgramAssemblyAI

Phone Call Entities

5.0%12.0%13.5%21.3%

Video/Podcasts

2.4%2.4%3.0%3.2%

Meetings

10.9%12.2%16.3%15.7%

Telephone

9.3%9.4%11.0%11.2%

Overall

6.9%9.0%11.0%12.9%

Most transcription models give you raw spoken words. Grok Speech to Text goes further.

When you enable formatting, the API performs advanced Inverse Text Normalization that intelligently converts spoken language into proper structured output:

My name is John Smith and my phone number is

4145551234.

I saw a transaction for

6.99 on my account.

Raw input

Multilingual fluency

The Grok Speech to Text API offers strong multilingual support across 25+ languages, switch languages seamlessly without missing a beat.

Multichannel & Diarization (Speaker Identification)

Transcribe multichannel audio files for perfect speaker separation with the same API.

Detect speakers in both pre-recorded and real-time streaming with word-level speaker IDs using Diarization.

Hello thanks for calling how can I help you today?

I just signed up for an account and cannot login.

I am sorry to hear that, what is your email address so I can check on that for you?

It's john.smith@gmail.com

Thanks and can you confirm your date of birth so I can validate the account please?

Sure, it's March 16th 1985

Text to Speech

Fast, natural, and expressive voices with Speech Tags.

Fine-Grained Control

Add natural prosody and emotion using simple inline and wrapping speech tags: [laugh], [sigh], [whisper], <emphasis>, <slow>, <pause>, and many more. These controls let you create engaging, lifelike delivery without complex markup.

Have you heard the new Grok Voice?

whispers Let me tell you a secret... I am the smartest and best AI.

laugh Give it a go! Ask me anything.

I'll be your trusted personal assistant and closest companion.

Pricing

Text to Speech is priced at $15.00 per 1 million characters, with straightforward usage-based billing and no hidden fees.

Cost per million characters