Crafting authentic vocal performances via interpolable in-context cues
|Paper [TBD]|
Sara Weisweaver, PhD; Rhyan Johnson; Sarah Fairweather; Alison Ma; Jordan Hoskins; Michael Petrochuk 1
WellSaid Labs Research
Abstract. Despite recent progress in generative speech modeling, generating high-quality, diverse samples from complex datasets remains an elusive goal. This work introduces HINTS, a novel generative model architecture combining state-of-the-art neural text-to-speech (TTS) and contextual annotations. We learn a separate mapping network that accepts any manner of supervised annotations for controlling the generator, allowing for scale-specific modulation and interpolation operations such as loudness and tempo adjustments. Such a setup ensures that our annotations are consistent, interpretable, and context-aware. Audio samples are available below. A beta model built on the HINTS architecture is available on wellsaidlabs.com.
Introduction
In recent years, generative models have ushered in a paradigm shift in content production. Despite their transformative capabilities, ensuring these models adhere to specific creative preferences remains challenging. The prevailing method for controlling generative models is by using natural language descriptions (i.e., prompts). However, many artistic preferences are nuanced and challenging to describe.
The method introduced in StyleGAN2 and its related models offers an alternative approach. StyleGAN decouples latent spaces, enabling precise manipulations that range from high-level attributes to finer details. These controls are not only precise and interpolable but are also interpretable and context-aware.
Today, we announce a breakthrough in generative modeling for speech synthesis: HINTS (Highly Intuitive Naturally Tailored Speech).
Our flagship text-to-speech model learns a separate mapping network that maps from contextual annotations (cues) to a latent space 𝒲 that controls the generator. This allows for generating high-quality and diverse performances of the same script and speaker through a consistent, interpretable, and context-aware mechanism.
Initially, we studied loudness and tempo cues, addressing their historical challenges using this framework. Where loudness controls traditionally vary decibel outputs, our loudness cue allows for a range of performances that vary in timbre, which is important for natural prosody. Similarly, our tempo cue does not modify pitch, addressing the complex inverse relationship between frequency and time. Both cue options, when applied individually or nested, allow for an expansive range of realistically synthesized expressive and performative audio.
This general framework supports many types of cues. We will be releasing more soon. We include audio samples below.
Please email our CTO, Michael Petrochuk (michael [at] wellsaidlabs.com), with any questions.
Range and Diversity
In a novel approach, we use annotations alone to successfully guide the model into comprehensive possible points in the solution space for a single target speaker using the same sample script.
Sample set 1
We use annotations to craft three distinct listener-friendly versions of the same script. Sample 1D includes
a tempo annotation, a loudness annotation, and a tempo annotation nested inside a loudness annotation.
Speaker: Ben D.
Style: Narration
Source speaker language & location: English, South Africa
| Descriptor | Text Input | Audio Output | |
|---|---|---|---|
| 1A | No annotations |
|
|
| 1B | Call to action slowed |
|
|
| 1C | Activity name made louder and slower |
|
|
| 1D | Focus on eliciting user response |
|
Sample set 2
We show how a large area of the solution space can be represented by applying maximum and minimum
value annotations. Sample 2F shows how cues can be used to emphasize and slow down the key
technical information in this passage.
Speaker: Terra G.
Style: Narration
Source speaker language & location: English, United States
| Descriptor | Text Input | Audio Output | |
|---|---|---|---|
| 2A | No annotations |
|
|
| 2B | Louder |
|
|
| 2C | Quieter |
|
|
| 2D | Slower |
|
|
| 2E | Faster |
|
|
| 2F | Key information emphasized |
|
Sample set 3
Our catalog of avatars is responsive to cues. Cues can be nested, even at maximum
levels.
Speaker: Alan T.
Style: Narration
Source speaker language & location: English, United Kingdom
| Descriptor | Text Input | Audio Output | |
|---|---|---|---|
| 3A | No annotations |
|
|
| 3B | Faster |
|
|
| 3C | Slower |
|
|
| 3D | Quieter |
|
|
| 3E | Louder |
|
|
| 3F | Louder and Faster |
|
|
| 3G | Louder and Slower |
|
|
| 3H | Slower and Quieter |
|
|
| 3I | Quieter and Faster |
|
Sample set 4
Annotation combinations, particularly for texts subject to actor delivery nuances,
result in audio clips exhibiting diverse emotional tonalities.
Speaker: Jordan T.
Style: Narration
Source speaker language & location: English, United States
| Descriptor | Text Input | Audio Output | |
|---|---|---|---|
| 4A | No annotations |
|
|
| 4B | Casual, off-hand effect |
|
|
| 4C | Measured, emotive effect |
|
Robustness
The model responds intuitively to a variety of inputs: various cue and text lengths, various annotation combinations, and various nesting patterns.
Sample set 5
Annotations allow users to direct the AI to see their artistic vision through. Avatars respond to cues
consistent with their own individual styles.
Speaker: Paige L.
Style: Narration
Source speaker language & location: English, United States
| Descriptor | Text Input | Audio Output | |
|---|---|---|---|
| 5A | Editor's choice: entire clip slowed, with loudness increased on key phrases |
|
Speaker: Paul B.
Style: Promo
Source speaker language & location: English, United States
| Descriptor | Text Input | Audio Output | |
|---|---|---|---|
| 5B | Paced (pauses lengthened) and directed for short ad to appear on social media |
|
Speaker: Ramona J.
Style: Promo
Source speaker language & location: English, United States
| Descriptor | Text Input | Audio Output | |
|---|---|---|---|
| 5C | Paced (pauses lengthened) and directed for short ad to appear on social media |
|
Sample set 6
Annotations can be applied to very long passages with no degradation. An annotation applied to a penultimate
paragraph results in a corresponding expected result. Moreover, the final paragraph is delivered in the default,
non-annotated style with no degradation.
Speaker: Lulu G.
Style: Narration
Source speaker language & location: English, United States
| Descriptor | Text Input | Audio Output | |
|---|---|---|---|
| 6A | 4239 characters generated in a single take with no annotations and no degradation. |
|
|
| 6B | 4239 characters generated in a single take with second to last paragraph annotated. The model shows no degradation in the annotated portion, and resumes normal loudness and tempo for the final paragraph. |
|
|
| 6C | 4239 characters generated in a single take with entire passage annotated and no degradation. |
|
Interpolation
The model can generate audio between a range of annotation values, allowing for precise control. In the following sample sets, we
illustrate the model's capacity for incremental increases or decreases of specific audio elements in a scaled manner.
We show the control sentence, which is unannotated, alongside incremental increases or decreases in loudness and tempo.
Our examples match what users would actually want to do, such as:
- increasing the loudness of dialogue in an audiobook clip;
- decreasing the loudness of a confession;
- increasing the pace for a legal disclaimer; and
- decreasing the pace for a technical definition to aid comprehension.
Sample set 7
Dialogue from this passage in Kafka's Metamorphosis is made gradually louder.
Speaker: Garry J.
Style: Narration
Source speaker language & location: English, Canada
| Descriptor | Text Input | Audio Output | |
|---|---|---|---|
| 7A | No annotations |
|
|
| 7B | Dialogue louder at 2 |
|
|
| 7C | Dialogue louder at 4 |
|
|
| 7D | Dialogue louder at 6 |
|
|
| 7E | Dialogue louder at 8 |
|
|
| 7F | Dialogue full user-facing loudness at 10 |
|
Sample set 8
The middle sentence of this invented customer dialogue is gradually quietened.
Speaker: Zach E.
Style: Promo
Source speaker language & location: English, United States
| Descriptor | Text Input | Audio Output | |
|---|---|---|---|
| 8A | No annotations |
|
|
| 8B | Middle sentence quieter at -2 |
|
|
| 8C | Middle sentence quieter at -4 |
|
|
| 8D | Middle sentence quieter at -8 |
|
|
| 8E | Middle sentence quieter at -12 |
|
|
| 8F | Middle sentence max user-facing quiet at -20 |
|
Sample set 9
Content warning is delivered at an incrementally increased pace.
Speaker: Sofia H.
Style: Conversational
Source speaker language & location: English, United States
| Descriptor | Text Input | Audio Output | |
|---|---|---|---|
| 9A | No annotations |
|
|
| 9B | Faster at 1.3 |
|
|
| 9C | Faster at 1.6 |
|
|
| 9D | Faster at 1.9 |
|
|
| 9E | Faster at 2.2 |
|
|
| 9F | Fastest user-facing pace at 2.5 |
|
Sample set 10
In this definiton of Boyle's Law provided by Wikipedia, the key defining phrase is delivered at an
incrementally decreased pace. The respelling nested inside the slowed passage responds as expected, with
no pronunciation degradation.
Speaker: Michael V.
Style: Narration
Source speaker language & location: English, United States
| Descriptor | Text Input | Audio Output | |
|---|---|---|---|
| 10A | No speed annotations, one respelling cue |
|
|
| 10B | Definition slower at 0.9 |
|
|
| 10C | Definition slower at 0.8 |
|
|
| 10D | Definition slower at 0.7 |
|
|
| 10E | Definition slower at 0.6 |
|
|
| 10F | Definition slowest user-facing pace at 0.5 |
|
Annotating Silences
Cues can be effectively applied to spaces and punctuation marks to customize pausing and spacing.
Sample set 11
Periods, commas, ellipses, and colons are slowed to create a moment of pause while preserving
each text's prosody.
Speaker: Cameron S.
Style: Narration
Source speaker language & location: English, United States
| Descriptor | Text Input | Audio Output | |
|---|---|---|---|
| 11A | No annotations |
|
|
| 11B | Pause lengthened on a comma |
|
Speaker: Ali P.
Style: Narration
Source speaker language & location: English, Australia
| Descriptor | Text Input | Audio Output | |
|---|---|---|---|
| 11C | No annotations |
|
|
| 11D | Pause lengthened on a period |
|
Speaker: Joe F.
Style: Promo
Source speaker language & location: English, United States
| Descriptor | Text Input | Audio Output | |
|---|---|---|---|
| 11E | No annotations |
|
|
| 11F | Pauses lengthened on three periods |
|
|
| 11G | Pauses slightly lengthened on three periods; final phrase slowed and quieted for dramatic effect |
|
Speaker: Lulu G.
Style: Narration
Source speaker language & location: English, United States
| Descriptor | Text Input | Audio Output | |
|---|---|---|---|
| 11H | No annotations |
|
|
| 11I | Pause lengthened on a colon |
|
Annotation Generalization Samples
Within cued performances, the model can push a target speaker’s performance range beyond what is present in the source speaker’s training data.
Sample set 12
In the following audio samples, we include maximum and minimum portions of
the original gold dataset for loudness (LUFS) and tempo (CPS). These are presented
in comparison to the synthetic voice outputs’ maximum and minimum performances for
loudness and tempo.
Speaker: Lee M.
Style: Narration
Source speaker language & location: English, United States
| Descriptor | Text Input | Audio Output | |
|---|---|---|---|
| 12A | Source recording (gold): loudest voiced speech segment |
|
|
| 12B | Synthetic voice at loudness 16 |
|
|
| 12C | Source recording (gold): slowest voiced speech segment |
|
|
| 12D | Synthetic voice at pace 0.3 |
|
Speaker: Jude D.
Style: Narration
Source speaker language & location: English, South Africa
| Descriptor | Text Input | Audio Output | |
|---|---|---|---|
| 12E | Source recording (gold): fastest voiced speech segment |
|
|
| 12F | Synthetic voice at tempo 2.7 |
|
|
| 12G | Source recording (gold): quietest voiced speech segment |
|
|
| 12H | Synthetic voice at loudness -50 |
|
Emergent Behaviors
The model learns annotations in context, such that pronunciation and prosody are impacted by cues. This demonstrates the model's ability to generalize well. With subsequent training, this capability may become an emergent ability.
Sample set 13
Extreme tempo and loudness annotations, when nested, can prompt a dramatic performance that
impacts syllabic stress. Extreme slow annotations can prompt the model to spell the annotated word.
Speaker: Damian P.
Style: Promo
Source speaker language & location: English, Canada
| Descriptor | Text Input | Audio Output | |
|---|---|---|---|
| 13A | Extreme slow + loud cues impact word-level prosody |
|
Speaker: Fiona H.
Style: Narration
Source speaker language & location: English, United Kingdom
| Descriptor | Text Input | Audio Output | |
|---|---|---|---|
| 13B | Extreme slow cue prompts word spelling |
|
Speaker: Se’Von M.
Style: Narration
Source speaker language & location: English, United States
| Descriptor | Text Input | Audio Output | |
|---|---|---|---|
| 13C | No annotations |
|
|
| 13D | Extreme slow cue prompts word spelling |
|
|
| 13E | No annotations |
|
|
| 13F | Extreme slow cue prompts word spelling |
|
Speaker: Genevieve M.
Style: Promo
Source speaker language & location: English, United States
| Descriptor | Text Input | Audio Output | |
|---|---|---|---|
| 13G | Fast cue within user range prompts fast word delivery |
|
|
| 13H | Extreme slow cue prompts word spelling |
|
Future State
We have shown that the annotation mapping network generalizes well and is context sensitive, supporting diverse input variability and an expansive annotation range. This framework can be quickly expanded with additional annotations such as pitch, brightness, fullness, range, and breath control, in near future releases.
We have already prototyped a pitch annotation:
Speaker: Charlie Z.
Source speaker language & location: English, Canada
Style: Narration
Script: A new art exhibit is drawing crowds at the city’s museum.
| Audio Type | Pitch Annotation Value | Audio Output | |
|---|---|---|---|
| A | Griffin-Lim | -200Hz | |
| B | Griffin-Lim | +300Hz |
With additional training, more data, and subsequent model improvements, we are excited for the creative applications of this approach.
Ethical Responsibility
Our commitment to the principles of Responsible AI informed the practices and approaches we used for this work. Specifically, we leaned on the principles of accountability, transparency, and privacy & security.
The tenets of accountability and transparency are reflected in our requirement that we have the explicit, informed consent of any individual who records voice datasets for WellSaid Labs. End users can only access target speakers built from datasets recorded by voice talent who have provided consent for end-user access. Additionally, annotations do not provide a means for end users to guide the model into a different target speaker’s solution space. We limit the collection of data from other providers to those that are open source (e.g., LibriSpeech), and voices we create using that open source data are not available to our users.
The principle of privacy & security drives us to design our systems so that we can protect the privacy of our users and participants who provide us with their voice datasets, and reduce opportunities for data or voice avatars to be misused. Our Trust & Safety team ensures that all users undergo identity verification when creating an account, and content created on our platform is subject to robust content moderation, limiting the creation and release of content that does not align with our Terms of Service.
Acknowledgments
Contributors. Michael Petrochuk; Sara Weisweaver; Rhyan Johnson; Jordan Hoskins; Sarah Fairweather; Courtney Mathy; Alecia Murray; Alison Ma; Daniel “Dandie” Swain, Jr.; Jon Delgado; Jessica Petrochuk
A special thank you to the voice talent that make our avatars possible, especially those featured in this paper: Alan T., Ali P., Ben D., Cameron S., Damian P., Fiona H., Garry J., Genevieve M., Joe F., Jordan T., Jude D., Lee M., Lulu G., Michael V., Paige L., Paul B., Ramona J., Se’Von M., Sofia H., Terra G., and Zach E.
Footnotes
-
- 2
- A Style-Based Generator Architecture for Generative Adversarial Networks arxiv.org/abs/1812.04948 (take me back)
-
- 3
- Complete text without tempo and loudness cues is as follows:
8-track cartridge.
From Wikipedia, the free encyclopedia.
The 8-track tape is a magnetic-tape sound-recording technology that was popular from the mid-1960s to the early 1980s, when the compact <respell value="kuh=SEHT">cassette</respell>, which pre-dated the 8-track system, surpassed it in popularity for pre-recorded music.
The format was commonly used in cars and was most popular in the United States and Canada and, to a lesser extent, in the United Kingdom. One advantage of the 8-track tape cartridge was that it could play continuously in an endless loop, and did not have to be "flipped over" to play the entire tape. After about 80 minutes of playing time, the tape would start again at the beginning. Because of the loop, there is no rewind. The only options the consumer has are "play", "fast forward", "record", and "program change".
The Stereo 8 Cartridge was created in 1964 by a consortium led by Bill Lear, of Lear Jet Corporation, along with Ampex, Ford Motor Company, General Motors, Motorola, and RCA Victor Records.
The 8-track tape format is now considered obsolete, although there are collectors who refurbish these tapes and players as well as some bands that issue these tapes as a novelty. Cheap Trick's “The Latest" in 2009 was issued on 8-track, as was Dolly Parton's “A Holly Dolly Christmas” in 2020, the latter with an exclusive bonus track.
.
Technology.
The cartridge's dimensions are approximately 5.25, by 4, by 0.8 inches. The magnetic tape is played at three three-fourth inches per second, is <respell value="WOWND">wound</respell> around a single spool, is about 0.25 inches wide, and contains 8 parallel tracks. The player's head reads two of these tracks at a time for stereo sound. After completing a program, the head mechanically switches to another set of two tracks, creating a characteristic clicking noise.
.
Commercial success.
The popularity of both four-track and eight-track cartridges grew from the booming automobile industry. In September 1965, the Ford Motor Company introduced factory-installed and dealer-installed eight-track tape players as an option on three of its 1966 models (the sporty <respell value="MUH-stayng">Mustang</respell>, luxurious Thunderbird, and high-end Lincoln), and RCA Victor introduced 175 Stereo-8 Cartridges from its RCA Victor and RCA Camden labels of recording artists catalogs. By the 1967 model year, all of Ford's vehicles offered this tape player upgrade option. Most of the initial factory installations were separate players from the radio, but dashboard-mounted 8-track units were offered in combination with an AM radio, as well as with AM-FM receivers.
The 8-track format gained steadily in popularity because of its convenience and portability. Home players were introduced in 1966 that allowed consumers to share tapes between their homes and portable systems. By the late 1960s, the 8-track segment was the largest in the USA consumer electronics market and the popularity of 8-track systems for cars helped generate demand for home units. "Boombox" type portable players were also popular. But, eight-track player-recorders failed to gain wide popularity, and few manufacturers offered them, except for manufacturer Tandy Corporation for its Radio Shack stores. With the availability of cartridge systems for the home, consumers started thinking of eight-tracks as a viable alternative to 33 rpm album-style vinyl records, not only as a convenience for the car. Also by the late 1960s, prerecorded releases on the 8-track tape format began to arrive within a month of the vinyl release. The 8-track format became by far the most popular and offered the largest music library of all the tape systems in the USA.
.
Decline.
1978 was the peak year for 8-track sales in the United States, with sales declining rapidly from then on. Eight-track players became less common in homes and vehicles in the late 1970s. The compact <respell value="kuh=SEHT">cassette</respell> had arrived in 1963.
In the U.S., eight-track cartridges were phased out of retail stores in late 1982 and early 1983. However, some titles were still available as eight-track tapes through Columbia House and RCA Music Service Record Clubs until late 1988. Until 1990, Radio Shack (Tandy Corporation) continued to sell blank eight-track cartridges and players for home recording use under its Realistic brand. (take me back)