Idea: Using AI as a pre-processor to improve traditional TT

2 points by phr4ts a month ago · 0 comments · 2 min read

I’ve been thinking about a way to make non-neural / traditional TTS sound much better without replacing the TTS engine itself.

The core idea is to insert an AI text pre-processor before TTS synthesis.

Instead of feeding raw text directly into TTS, an AI model parses and rewrites the text to optimize it for speech, handling things that current TTS pipelines do poorly unless the user is an SSML expert.

What the pre-processor would do:

1. Control pacing, rhythm and pitch: Automatically infer pauses, emphasis, and sentence flow. Most users don’t know SSML, but good pacing alone significantly improves perceived quality.

2. Context-aware pronunciation Example: “I want US to eat together.” Here, “US” should be pronounced as “us,” not “U.S.”

3. Rewrite text for pronunciation clarity.

Normalize numbers: 10 000 → 10,000 or “ten thousand”

Adjust foreign names or ambiguous words

Phonetic hints when needed (e.g., sake → “sayk”)

Small rewrites that preserve meaning but improve speech output

This wouldn’t reach the quality of full neural TTS, but it could dramatically narrow the gap, especially for:

low-resource environments

embedded systems

legacy TTS engines

cost-sensitive use cases

Curious if anyone has seen similar approaches in production, or if this is already being done quietly somewhere.

No comments yet.

Settings

Idea: Using AI as a pre-processor to improve traditional TT

Keyboard Shortcuts