A lightweight library for normalizing speech transcripts before computing WER.
Quick Start · Step Reference · Contributing
Why normalization matters
Word Error Rate (WER) is the standard metric for evaluating speech-to-text systems. But WER operates on raw strings — it has no notion of meaning. Two transcriptions that say the same thing in different surface forms get penalized as errors:
| Ground truth | STT output | Match without normalization |
|---|---|---|
| It's $50 | it is fifty dollars | 0/3 words match |
| 3:00 PM | 3 pm | 0/2 words match |
| Mr. Smith | mister smith | 0/2 words match |
These aren't transcription errors — they're formatting differences. Without normalization, WER scores become unreliable and comparisons across engines are meaningless.
gladia-normalization solves this by reducing both the ground truth and the STT output to a shared canonical form before WER is computed, so that only genuine recognition errors affect the score.
What it does
The library runs your text through a configurable pipeline of normalization steps — expanding contractions, converting symbols to words, removing fillers, casefolding, and more — to produce a clean, canonical output.
Input: "It's $50.9 at 3:00PM — y'know, roughly."
Output: "it is 50 point 9 dollars at 3 pm you know roughly"
The pipeline is deterministic, language-aware, and fully defined in YAML — run the same preset and get the same output every time.
Quick start
Installation
pip install gladia-normalization
Install from source
git clone https://github.com/gladiaio/normalization.git
cd normalization
uv syncUsage
from normalization import load_pipeline # Load a built-in preset by name pipeline = load_pipeline("gladia-3", language="en") pipeline.normalize("It's $50 at 3:00PM") # => "it is 50 dollars at 3 pm"
How it works
Every pipeline runs exactly three stages, always in this order:
- Stage 1 — Text pre-processing: Full-text transforms: protect symbols, expand contractions, convert numbers, casefold, remove symbols
- Stage 2 — Word processing: Per-token transforms: word replacements, filler removal
- Stage 3 — Text post-processing: Full-text cleanup: restore placeholders, collapse digits, format time patterns, normalize whitespace
This ordering is a hard constraint. Some steps depend on earlier steps having run (e.g. a placeholder protecting a decimal point in Stage 1 must be restored in Stage 3, so that remove_symbols doesn't destroy it in between).
Pipelines are defined declaratively in YAML presets. Each preset lists the steps that run in each stage and the order they run in. See the full step reference for every available step.
Supported languages
| Code | Language |
|---|---|
en |
English |
fr |
French (alpha) |
Unsupported language codes fall back to a safe default that applies language-independent normalization only.
Adding a new language is self-contained — create a folder, register it with a decorator, done. See Contributing.
Custom presets
A preset is a YAML file that declares which steps run in each stage and in what order.
name: my-preset-v1 stages: text_pre: - protect_email_symbols - expand_contractions - casefold_text - remove_symbols - remove_diacritics - normalize_whitespace word: - apply_word_replacements text_post: - restore_email_at_symbol_with_word - restore_email_dot_symbol_with_word - normalize_whitespace
Load from your custom configuration:
from normalization import load_pipeline pipeline = load_pipeline("path/to/my-custom-configuration.yaml", language="en") result = pipeline.normalize("some transcription text")
Inspect a loaded pipeline:
pipeline.describe() # {'name': 'my-preset-v1', 'language': 'en', 'text_pre_steps': [...], ...}
Preset rules:
- Step names must match the
nameattribute of a registered step class. - Every
protect_*step intext_prerequires a matchingrestore_*intext_post. The pipeline validates this at load time. - List order is execution order.
- Published presets are immutable — new behavior means a new file.
Contributing
Bug reports, new steps, and new language support are all welcome. See CONTRIBUTING.md for the full guide — including how to add steps, add languages, write tests, and the commit style we follow.
Development
uv run pre-commit install --install-hooks # install hooks once after cloning uv run pytest # run tests uv run ruff check . # lint uv run ruff format . # format uv run ty check # type-check
About
gladia-normalization grew out of internal tooling at Gladia, where we are building an audio intelligence platform powered by speech recognition. When benchmarking ASR systems, we kept hitting the same problem: computing WER from raw transcript penalizes formatting differences that have nothing to do with the quality itself. We built this library to solve it for ourselves, then open-sourced it so the broader speech community doesn't have to solve it again.
Sharing it felt like the right next step: the problem is universal, and community contributions are the best way to make reliable normalization available for every language, not just the ones we support today.