GitHub - egorsmkv/speech-recognition-uk: 🇺🇦 Speech Recognition & Synthesis for Ukrainian

Model	WER	CER	Accuracy (words)
theodotus/stt_uk_squeezeformer_ctc_xs	10.78%	2.29%	89.22%
theodotus/stt_uk_squeezeformer_ctc_sm	8.2%	1.75%	91.8%
theodotus/stt_uk_squeezeformer_ctc_ml	5.91%	1.26%	94.09%

`Conformer-CTC`

Model	WER	CER	Accuracy (words)
taras-sereda/uk-pods-conformer	6.75%	1.41%	93.25%

`Whisper`

Model	WER	CER	Accuracy (words)
tiny	63.08%	18.59%	36.92%
base	52.1%	14.08%	47.9%
small	30.57%	7.64%	69.43%
medium	18.73%	4.4%	81.27%
large (v1)	16.42%	3.93%	83.58%
large (v2)	13.72%	3.18%	86.28%
large (v3)	20.53%	5.28%	79.478%
turbo	22.83%	7.05%	77.17%

Quantized versions:

Model	WER	CER	Accuracy (words)
Yehor/whisper-large-v2-quantized-uk	14.95%	4.23%	85.05%
Yehor/whisper-large-v3-turbo-quantized-uk	12.75%	3.25%	87.25%
efficient-speech/lite-whisper-large-v3-turbo	42.89%	12.59%	57.11%
efficient-speech/lite-whisper-large-v3-turbo-acc	17.79%	4.34%	82.21%

If you want to fine-tune a Whisper model on own data, then use this repository: https://github.com/egorsmkv/whisper-ukrainian

`Flashlight`

Model	WER	CER	Accuracy (words)
Flashlight Conformer	19.15%	2.44%	80.85%

`data2vec`

Model	WER	CER	Accuracy (words)
robinhad/data2vec-large-uk	31.17%	7.31%	68.83%

`VOSK`

Model	WER	CER	Accuracy (words)
v3	53.25%	38.78%	46.75%

`m-ctc-t`

Model	WER	CER	Accuracy (words)
speechbrain/m-ctc-t-large	57%	10.94%	43%

`DeepSpeech`

Model	WER	CER	Accuracy (words)
v0.5	70.25%	20.09%	29.75%

`moonshine-tiny-uk`

Model	WER	CER	Accuracy (words)
UsefulSensors/moonshine-tiny-uk	24.54%	7.58%	75.46%

📖 Development

How to train own model using Kaldi
How to train a KenLM model based on Ukrainian Wikipedia data: https://github.com/egorsmkv/ukwiki-kenlm
Export a traced JIT version of wav2vec2 models: https://github.com/egorsmkv/wav2vec2-jit

📚 Datasets

Compiled dataset: ~1200 hours

Dataset: https://nx16725.your-storageshare.de/s/cAbcBeXtdz7znDN, use Wget to download, downloading in a browser has speed limitations, or use torrent file

⭐ Related works

Language models

Ukrainian LMs: https://huggingface.co/Yehor/kenlm-uk

Inverse Text Normalization

WFST for Ukrainian Inverse Text Normalization: https://github.com/lociko/ukraine_itn_wfst

Text Enhancement

Punctuation and capitalization model: https://huggingface.co/dchaplinsky/punctuation_uk_bert (demo: https://huggingface.co/spaces/Yehor/punctuation-uk)

Aligners

NeMo Forced Aligner: https://github.com/NVIDIA/NeMo/tree/main/tools/nemo_forced_aligner
Aligner for wav2vec2-bert models: https://github.com/egorsmkv/w2v2-bert-aligner
Aligner based on FasterWhisper (mostly for TTS): https://github.com/patriotyk/narizaka
Aligner based on Kaldi: https://github.com/proger/uk

Other

A space to calculate ASR metrics: https://huggingface.co/spaces/Yehor/evaluate-asr-outputs
A space to see ASR outputs: https://huggingface.co/spaces/Yehor/see-asr-outputs

📢 Text-to-Speech

Test sentence with stresses:

К+ам'ян+ець-Под+ільський - м+істо в Хмельн+ицькій +області Укра+їни, ц+ентр Кам'ян+ець-Под+ільської міськ+ої об'+єднаної територі+альної гром+ади +і Кам'ян+ець-Под+ільського рай+ону.

Without stresses:

Кам'янець-Подільський - місто в Хмельницькій області України, центр Кам'янець-Подільської міської об'єднаної територіальної громади і Кам'янець-Подільського району.

📦 Implementations

StyleTTS2

StyleTTS2 demo & the code

P-Flow TTS

P-Flow TTS

audio.mp4

RAD-TTS

RAD-TTS, the voice "Lada"
RAD-TTS with three voices, voices of Lada, Tetiana, and Mykyta

demo.mp4

Coqui TTS

v1.0.0 using M-AILABS dataset: https://github.com/robinhad/ukrainian-tts/releases/tag/v1.0.0 (200,000 steps)
v2.0.0 using Mykyta/Olena dataset: https://github.com/robinhad/ukrainian-tts/releases/tag/v2.0.0 (140,000 steps)

tts_output.mp4

Neon TTS

Coqui TTS model implemented in the Neon Coqui TTS Python Plugin. An interactive demo is available on huggingface. This model and others can be downloaded from huggingface and more information can be found at neon.ai

neon_tts.mp4

FastPitch

NVIDIA FastPitch: https://huggingface.co/theodotus/tts_uk_fastpitch

Balacoon TTS

Balacoon TTS, voices of Lada, Tetiana and Mykyta. Blog post on model release.

balacoon_tts.mp4

MMS

https://huggingface.co/facebook/mms-tts-ukr

📚 Datasets

Open Text-to-Speech voices for 🇺🇦 Ukrainian: https://huggingface.co/datasets/Yehor/opentts-uk
- Voice LADA, female
- Voice TETIANA, female
- Voice KATERYNA, female
- Voice MYKYTA, male
- Voice OLEKSA, male

⭐ Related works

Accentors

Grapheme-to-Phoneme

ipa-uk:

Charsiu G2P:

Other:

Misc

Tool to make high quality text to speech (TTS) corpus from audio + text books: https://github.com/patriotyk/narizaka
Text Normalization:
- https://huggingface.co/skypro1111/m2m100-ukr-verbalization (see the demo)
- https://huggingface.co/skypro1111/mbart-large-50-verbalization
Audio Aesthetics for opentts-uk: https://huggingface.co/datasets/Yehor/opentts-uk-aesthetics

🇺🇦 Speech Recognition & Synthesis for Ukrainian

Overview

Speech-UK initiative

Community

🎤 Speech-to-Text

📦 Implementations

📊 Benchmarks

wav2vec2-bert

wav2vec2

HuBERT

Citrinet

ContextNet

FastConformer P&C

Squeezeformer

Conformer-CTC

Whisper

Flashlight

data2vec

VOSK

m-ctc-t

DeepSpeech

moonshine-tiny-uk

📖 Development

📚 Datasets

Compiled dataset: ~1200 hours

Voice of America: ~390 hours

FLEURS

Ukrainian broadcast: ~300 hours

YODAS2: ~400 hours

Ukrainian podcasts

Cleaned Common Voice 10 (test set)

Noised Common Voice 10

Other

⭐ Related works

Language models

Inverse Text Normalization

Text Enhancement

Aligners

Other

📢 Text-to-Speech

📦 Implementations

📚 Datasets

⭐ Related works

Accentors

Grapheme-to-Phoneme

Misc

`wav2vec2-bert`

`wav2vec2`

`HuBERT`

`Citrinet`

`ContextNet`

`FastConformer P&C`

`Squeezeformer`

`Conformer-CTC`

`Whisper`

`Flashlight`

`data2vec`

`VOSK`

`m-ctc-t`

`DeepSpeech`

`moonshine-tiny-uk`