🦜Parakeet-unified-en-0.6b: Unified ASR model for offline and streaming inference
Parakeet-unified-en-0.6b is an English automatic speech recognition (ASR) model based on transducer architecture (RNN-T) combining both offline and streaming inference (up to 160ms latency) in one model. It is trained mostly on the English part of the Granary dataset [3], which contains approximately 250,000 hours of US English (en-US) speech across diverse acoustic conditions. The model transcribes speech to English alphabet, spaces, and apostrophes with punctuation and captalization support.
Why Choose nvidia/parakeet-unified-en-0.6b?
- One model for both tasks: You need to utilize only one unified model for both offline and streaming inference with latency up to 160ms.
- Better accuracy performance: The unified model achieves better accuracy performance on the HF ASR Leaderboard datasets compared to the previous transducer-based offline and streaming only models.
- Streaming chunk size flexibilty: Enables you to choose the optimal streaming latency (chunk + right context) from 2080ms to 160ms with step of 80ms.
- Punctuation & Capitalization: Built-in support for punctuation and capitalization in output text
This model consists of a 🦜 Parakeet (FastConformer) encoder (jointly trained in offline and streaming modes) with an RNN-T decoder. It is designed for offline and streaming speech-to-text applications where latency can be up to 160ms, such as voice assistants, live captioning, and conversational AI systems. The current inference pipeline supports only buffered streaming (left context is recomputed for each chunk) that can be longer than cache-aware streaming.
This model is ready for commercial/non-commercial use.
License/Terms of Use:
Governing Terms: Use of the model is governed by the NVIDIA Open Model License Agreement.
Deployment Geography:
Global
Use Case:
This model is for transcription of English audio in offline and streaming modes.
Release Date:
- Hugging Face [04/07/2026] via https://huggingface.co/nvidia/parakeet-unified-en-0.6b
Model Architecture
Architecture Type: Unified-FastConformer-RNNT
The model is based on the FastConformer encoder architecture [1] with 24 encoder layers and an RNNT (Recurrent Neural Network Transducer) decoder. The model was trained jointly in offline and streaming modes. In the offline mode we used standard offline training with full-context self-attention and non-causal convolutions. In the streaming mode we applied chunked self-attention masks (incluing left, middle/chunk and right context) together with Dynamic Chunked Convolutions inside each FastConformer layer [2] to adapt the model to both decoding scenarios. We also introduced a novel mode-consistency regularization loss to further reduce the gap between offline and streaming performance. All the model parameters are shared between offline and streaming modes (encoder, predictor, and joint networks), including initial x8 subsampling with non-causal convolutions.
The paper with the details of the model architecture and training will be released soon.
Network Architecture:
- Encoder: Unified FastConformer with 24 layers
- Decoder: RNNT (Recurrent Neural Network Transducer)
- Parameters: 600M
NVIDIA NeMo
How to Use this Model
For now, we provide only inference support for the unified model. We will release the unified training pipeline soon.
Loading the Model
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="nvidia/parakeet-unified-en-0.6b")
Offline Inference
output = asr_model.transcribe([wav_file_path])
print(output[0].text)
Streaming Inference
For streaming inference you can use statfull chunked RNN-T decoding script from NeMo - /NeMo/blob/main/examples/asr/asr_chunked_inference/rnnt/speech_to_text_streaming_infer_rnnt.py
cd NeMo
python examples/asr/asr_chunked_inference/rnnt/speech_to_text_streaming_infer_rnnt.py \
model_path=<model_path> \
dataset_manifest=<dataset_manifest> \
output_filename=<output_json_file> \
left_context_secs=<left_context_secs> \ # left context in seconds, 5.6s by default
chunk_secs=<chunk_secs> \ # chunk size in seconds, 0.56s by default
right_context_secs=<right_cintext_secs> \ # right context in seconds, 0.56s by default
att_context_size_as_chunk=true \ # set to true to use chunked self-attention masks
batch_size=<batch_size>
You can also run streaming inference through the pipeline method, which uses NeMo/examples/asr/conf/asr_streaming_inference/buffered_rnnt.yaml configuration file to build end‑to‑end workflows with punctuation and capitalization (PnC), inverse text normalization (ITN), and translation support.
from nemo.collections.asr.inference.factory.pipeline_builder import PipelineBuilder
from omegaconf import OmegaConf
# Path to the buffered rnnt config file downloaded from above link
cfg_path = 'buffered_rnnt.yaml'
cfg = OmegaConf.load(cfg_path)
# Pass the paths of all the audio files for inferencing
audios = ['/path/to/your/audio.wav']
# Create the pipeline object and run inference
pipeline = PipelineBuilder.build_pipeline(cfg)
output = pipeline.run(audios)
# Print the output
for entry in output:
print(entry['text'])
Setting up Streaming Configuration
Latency is defined as the sum of the chunk size (middle part) and the right context. For the left context we use 5.6s by default (5.6s was used during the model training), but you can try to find the optimal value for better accuracy/speed trade-off.
We would recommend to use the following context parameters for different latencies:
| Left, s | Chunk, s | Right, s | Latency (C+R), s |
|---|---|---|---|
| 5.6 | 1.04 | 1.04 | 2.08 |
| 5.6 | 0.56 | 0.56 | 1.12 |
| 5.6 | 0.16 | 0.40 | 0.56 |
| 5.6 | 0.08 | 0.24 | 0.32 |
| 5.6 | 0.08 | 0.16 | 0.24 |
| 5.6 | 0.08 | 0.08 | 0.16 |
Input
- Input Type(s): Audio
- Input Format(s): wav
- Input Parameters: One-Dimensional (1D)
- Other Properties Related to Input: Maximum Length in seconds specific to GPU Memory, No Pre-Processing Needed, Mono channel is required. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Output
- Output Type(s): Text String in English
- Output Format(s): String
- Output Parameters: One-Dimensional (1D)
- Other Properties Related to Output: No Maximum Character Length, transcribe punctuation and capitalization. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Datasets
Training Datasets
The majority of the training data comes from the English portion of the Granary dataset [3]:
- YouTube-Commons (YTC) (109.5k hours)
- YODAS2 (102k hours)
- Mosel (14k hours)
- LibriLight (49.5k hours)
In addition, the following datasets were used:
- Librispeech 960 hours
- Fisher Corpus
- Switchboard-1 Dataset
- WSJ-0 and WSJ-1
- National Speech Corpus (Part 1, Part 6)
- VCTK
- VoxPopuli (EN)
- Europarl-ASR (EN)
- Multilingual Librispeech (MLS EN)
- Mozilla Common Voice (v11.0)
- Mozilla Common Voice (v7.0)
- Mozilla Common Voice (v4.0)
- People Speech
- AMI
Data Modality: Audio and text
Audio Training Data Size: 530k hours
Data Collection Method: Human - All audios are human recorded
Labeling Method: Hybrid (Human, Synthetic) - Some transcripts are generated by ASR models, while some are manually labeled
Evaluation Datasets
The model was evaluated on the HuggingFace ASR Leaderboard datasets:
- AMI
- Earnings22
- Gigaspeech
- LibriSpeech test-clean
- LibriSpeech test-other
- SPGI Speech
- TEDLIUM
- VoxPopuli
Performance
ASR Performance (w/o PnC)
ASR performance is measured using the Word Error Rate (WER). Both ground-truth and predicted texts are processed using whisper-normalizer version 0.1.12. The obtained results for other models can be slightly different from the official HF model cards because of the different evaluation machines.
The following table show the WER on the HuggingFace OpenASR leaderboard datasets including offline and streaming inference with different latency values:
| Model setup | Offline | 2.08s | 1.12s | 0.56s | 0.40s | 0.32s | 0.24s | 0.16s | 0.08s |
|---|---|---|---|---|---|---|---|---|---|
| nvidia/parakeet-tdt-0.6b-v2 | 6.04 | 7.99 | 22.83 | 69.55 | 95.12 | — | — | — | — |
| nvidia/nemotron-speech-streaming-en-0.6b | 6.92 | 7.46 | 6.92 | 7.09 | 9.52 | 7.64 | 8.01 | 7.84 | 8.70 |
| nvidia/parakeet-unified-en-0.6b | 5.91 | 6.14 | 6.29 | 6.52 | 6.70 | 6.92 | 7.35 | 8.44 | 15.63 |
Parakeet-unified-en-0.6b model outperforms previous NVIDIA transducer-based models in offline and streaming (up to 240ms latency) inference modes. At 160ms latency, the unified model start to degrade because of the ansence of enough right context, yielding slightly to the strong streaming baseline. For 80ms latency we would recommend to use nemotron-speech-streaming-en-0.6b model instead.
Software Integration
Runtime Engine: NeMo 2.7.3
Supported Hardware Microarchitecture Compatibility:
- NVIDIA Ampere
- NVIDIA Blackwell
- NVIDIA Hopper
- NVIDIA Volta
Test Hardware:
- NVIDIA V100
- NVIDIA A100
- NVIDIA A6000
- DGX Spark
Preferred/Supported Operating System(s): Linux
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.
References
[1] Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition
[2] Dynamic Chunk Convolution for Unified Streaming and Non-Streaming Conformer ASR
[3] NVIDIA Granary