Deep Voice: Real-Time Neural Text-To-Speech
research.baidu.comHey there! I'm one of the authors of the paper and I'm happy to answer any questions anyone may have!
Make sure to check out the paper on arxiv as well.
Hi,
That is some very nice and interesting work! In fact, I have also worked on exactly the same thing, so I'm impressed by your accomplishments.
How much have you played around with different local condition features, i.e. the phoneme signal? Was it always with 256 Hz? Have you always used nearest-neighbor for upsampling to 16 kHz? Have you always used those 2 + (1 + 2 + 2) * (40 + 5) = 227 dimensions? We tried just with 39 dimensional phonemes, which also worked but the quality was not so nice and it sounded very robotic, probably due to missing F0. We also only had 100 Hz, but we tried some variants to upscale it to 16 kHz, like linear interpolation or deconv or combinations of them.
In the local conditioning network, you used QRNNs. Did you also try simpler methods, like just pure convolution? (And then the upsampling like you did, by nearest neighbor.)
You are predicting phone duration + F0. Have you also tried an encoder-decoder approach instead, like in Char2Wav? I.e. instead of the duration prediction, you let the decoder unroll it. Then, also like Char2Wav, you can also combine that directly with your Grapheme-to-Phoneme model. Have you tried that?
Did you also try some global condition, like speaker identity?
We also tried all the sampling methods you are listing and observed the same behavior, i.e. only the direct sampling really works. I tried many more deterministic variants (like taking mean) but none of them worked. This is a bit strange. Also the quality can vary depending on the random seed.
Thanks, Albert
Feel free to get in touch for more Q/A, my email is in my profile.
We've experimented a bunch with many of these hyperparameters. Our phoneme signal has mostly stayed 256 Hz, but we've done a few experiments with lower-frequency signals that indicate it's probably possible to reduce it.
We have used many types of upsampling, and find that the upsampling and conditioning procedure does not affect the quality of the audio itself, but does affect the frequency of pronunciation mistakes. We used bicubic and bilinear interpolation based upsampling, as well as transposed convolutions and a variety of other simpler convolutions (for example, per-channel transposed convolutions). These tend to work and converge, but then generate pronunciation mistakes on difficult phonemes. A full transposed convolution upsampling (two transposed convolution layers with stride 8 each) works almost as well as our bidirectional QRNNs, but it's much, much, more expensive in terms of compute and parameters, and takes longer to train as well.
As noted in the paper, we used many of the original features used for WaveNet before reducing our feature set. F0 is definitely important for proper intonation. We find that including the surrounding phonemes is quite important; with the bidirectional QRNN upsampling, leaving those out still works, but not nearly as well. It seems likely that a different conditioning network would remove the need for those "context" phonemes.
We have not yet used an encoder-decoder approach for duration or F0. Char2Wav has a bunch of interesting ideas, and it may be a direction for our future work. However, we do not plan on including the grapheme-to-phoneme model into our main model, because it's crucial that we easily affect the pronunciation of phonemes with a phoneme dictionary; by having an explicit grapheme-to-phoneme step, we can easily set the pronunciation for unseen words (like "P!nk" or "Worcestershire"; an integrated grapheme-to-phoneme model would not be able to do those, even humans usually cannot!).
We have not yet worked with speaker global conditioning, but it is likely that the results from the WaveNet paper apply to our WaveNet implementation as well.
Finally, as for sampling, we have not seen much variation due to random seed for a fully converged model. However, our intuition for why sampling is important is that the speech distribution is (a) multimodal and (b) biased towards silence. If you are interested, you can gain a little bit of intuition about what the distribution actually looks like by just plotting a color map across time, with high-probability values being bright and low probability values being dark; it generates a pretty plot, and you can see that some areas are clearly stochastic (especially fricatives) and some areas are multimodal (vowel wave peaks).
How close (# years?)are we to being able to replicate the voices of any given individual with sufficient samples of their voiceprint?
It's hard to say! We don't quite know exactly how many parameters or minutes of audio are needed to describe fully someone's voice and speaking patterns. Maybe one or two, maybe much more.
Do you expect derivatives of this to surpass the effort by Adobe with VoCo? From my untrained perspective they appear quite similar in functionality
I don't quite know what VoCo does, but it seems like a concatenative system that they've tuned a huge amount. I'm a little skeptical that it works as well and as reliably in real life as it does in demos. But, even so, there parametric models tend to be much smaller in size and more flexible, so there may be applications where WaveNet-style systems are applicable in ways concatenative systems can't handle (high quality on-device TTS, emotive TTS, speaker synthesis for new unheard speakers, etc).
A simpler problem could be to identify someone based on voice. Is that problem already solved? And can we use this to solve the problem of generating someone's voice?
That has been possible for years, and is even a typical student assignment in speech processing courses. A quick search gave this example course at Cornell
http://people.ece.cornell.edu/land/courses/ece5760/FinalProj...
"My voice is my passport."
Sneakers (1992): My Voice Is My Passport
Verify me.
aahahahaha !
Afaik VoCo isn't creating anything from thin air, instead it scans the available voice data (it reportedly needs a sample of about 20 mins of a person speaking) and copies fragments of it in specific order to create a sentence.
Hi Andrew, congratulations on your result! A few questions, feel free to answer one or any. How close do you think you are to having fully end-to-end models for speech? Are you optimistic we can get speech synthesis to run on mobile devices in the near future? Do the inference optimizations (particularly sample embedding and layer inference) generalize well to other architectures, like speech recognition? It seems that if these models are going to run offline in realtime on mobile devices, we will need to have specialized hardware, but maybe we can squeeze enough performance out of mobile CPUs to get a highly optimized version to work. Thanks!
Thank you!
For fully end-to-end models, it's hard to say exactly. The Char2Wav paper demonstrates that there is hypothetically an architecture and a set of weights that can do synthesis end-to-end, but we cannot yet train such a system. On Reddit, one of the Char2Wav authors comments that they tried training it directly and didn't get great results, and at SVAIL we've also had some trouble doing so. I think it is very likely going to happen in the next several months or year, but we don't yet know exactly what needs to happen in order to get it to work.
As for inference, some of the inference optimizations do generalize. In fact, the GPU optimizations (persistent kernels) were originally developed by our systems team, and published in the Persistent RNN [0] paper. (This is a really powerful technique that CUDA makes very hard to implement, and I have a massive amount of respect for the folks who managed to make it work!) Persistent RNNs make training at close-to-peak-FLOPs with very low batch sizes plausible, and make GPU WaveNet inference plausible. At the moment, our CPU kernels are much more promising, but we don't know whether that will stay the case. For mobile, I think it is possible to get the current systems to work on fairly powerful mobile CPUs with a bunch more work into optimization and low-level assembly, but we haven't done it yet so time will tell.
[0] https://svail.github.io/persistent_rnns/ and http://jmlr.org/proceedings/papers/v48/diamos16.pdf
>> Are you optimistic we can get speech synthesis to run on mobile devices in the near future?
You mean high quality right? I mean speech synth has been around for decades that can run on cheap hardware and is understandable. Speech recognition has also been around for a long time, but there's a huge difference in usability between "pretty good recognition" and "pretty good synthesis". One is useful, the other not so much.
Nice job! The samples sound good.
Is there an implementation of this to check out? It seems like you needed to write some custom, low-level code to implement this in real-time. Which libraries did you use to generate the ANNs and do the inferences?
We are not currently releasing any code, but hopefully the paper on arxiv is enough to make it easy to reproduce the result.
We use TensorFlow for writing and training the model and c++ with a lot of hand optimizations for inference, with assembly kernels written with PeachPy (which is an awesome piece of software!)
I didn't know Baidu uses TensorFlow! Is that the main framework there now or are others in use?
Baidu as a company doesn't use TensorFlow (as far as I know). We have our own high-performance and easy to use open source framework called PaddlePaddle [0], which is quite powerful and flexible.
However, the Baidu Research Silicon Valley AI Lab (SVAIL) allows researchers and research teams to use whatever frameworks they want to, and we have projects using TensorFlow, Torch, our own SVAIL-internal internal high-performance RNN framework, and PaddlePaddle. Using our own framework sometimes allows us to work on very high-performance implementations of various primitives and techniques that would be harder to do without complete control over the source code.
Hey, one quick question, did the QRNNs work better and faster than LSTMs out of the box, or did you guys have to tune hyperparameters?
We didn't actually try LSTMs, because we train in 1.25 second chunks, so running an LSTM for several hundred timesteps would drastically slow down training. Our per iteration time was in the 200-500 milliseconds, and using an LSTM or GRU would likely bump that into the 1-3 second range, maybe more, whereas the QRNN conditioning actually make it cheaper than the transposed convolution conditioning by 20-40%.
The upsampling procedure is quite finicky, so we had quite a few iterations there, but we didn't have to tune hyperparameters too much of the QRNN itself. Once we implemented the QRNN in CUDA for TensorFlow and got it to train, it worked without too much trouble.
Our collaborators in Beijing mentioned that bidirectional LSTMs also worked in a similar way, though.
Ah, thanks.
For those of us interested in this area of research what are the best papers and other resources for us to read? Has there been any success with deep approaches that do not have the WaveNet architecture?
Check out Char2Wav (recent) and SampleRNN (the RNN-based audio synthesis architecture). The related work section of the Deep Voice paper mention a bunch of related papers that are relevant!
This sounds cool. What would it take for me to build an Android App with this technology ? Do we have Android/java libraries ?
How much computing power does this take. When do you see open source implementations running on mobile devices offline?
We take several days (2-3) on 8 Titan X GPUs to train our models, which is quite a lot of compute. Running on mobile devices is quite challenging – the inference is not yet fast enough to support that, and has only been optimized for x86 AVX2 CPUs. It may be possible with a fair amount of future work!
That's pretty quick. Most people take longer to learn how to read that well.
Interesting. They are not TTS like we are accustomed to, they are replicating a specific persons voice with TTS. Listen to the ground-truth recordings at the bottom and then the synthesized versions above. "Fake News" is about to get a lot more compelling when you can make anyone say anything as long as you have some previous recordings of their voice.
> you can make anyone say anything as long as you have some previous recordings of their voice.
That's not what this is doing. They're simply resynthesizing exactly what the person said, in the same voice. It's essentially cheating because they can use the real person's inflection. Generating correct inflection is the hardest part of speech synthesis because doing it perfectly requires a complete understanding of the meaning of the text.
The top two are representative of what it sounds like when doing true text to speech. The middle five are just resynthesis of a clip saying the exact same thing. And even in that case, it doesn't always sound good. The fourth one is practically unintelligible. But it's interesting because it demonstrates an upper bound on the quality of the voice synthesis possible with their system given perfect inflection as input.
To clarify, this is cool work, the real-time aspect sounds great, and I'm sure it will lead to even more impressive results in the future. But I don't want people to think that all of the clips on this page represent their current text-to-speech quality.
Thank you for clarifying this! We tried fairly hard to make this clear, because as you say, the hard part is generating inflection and duration that sounds natural. There's still a ton of work left to do in this duration – we're clearly nowhere near being able to generate human-level speech.
Our work is meant to make working with TTS easier to deep learning researchers by describing a complete and trainable system that can be trained completely from data, and demonstrate that the neural vocoder substitutes can actually be deployed to streaming production servers. Future work (both by us and hopefully other groups) will make further progress for inflection synthesis!
My "Fake News" comment aside, I think what y'all are doing could be transformational for many reasons. Imagine a scenario where a person loses a loved one, and similar technology is able to allow them to "have conversations" with the deceased as a form of healing and closure. Not to mention, this could add a personal touch to assistant bots that will make them a pleasure to use.
>The top two are representative of what it sounds like when doing true text to speech. The middle five are just resynthesis of a clip saying the exact same thing.
Gotcha, now I understand.
>> They're simply resynthesizing exactly what the person said, in the same voice. It's essentially cheating because they can use the real person's inflection.
Yes, but imagine being able to take sound from one person and inflection from another. If you want to fake someone saying something you don't need to do pure TTS, a human can be used to fake another persons inflections.
Based upon what little is posted there, I thought they were taking the original recording, then training the model on that recording against the text of the recording... reproducing the recording. I would think next step is to sample enough audio and text to be able to produce new outputs entirely. It should in theory even be able to learn when/where/how to use inflection.
> "Fake News" is about to get a lot more compelling hen you can make anyone say anything as long as you have some previous recordings of their voice.
Adobe has already developed that technology:
https://arstechnica.co.uk/information-technology/2016/11/ado...
Now imagine combining it with this:
Face2Face: Real-time Face Capture and Reenactment of RGB Videos https://www.youtube.com/watch?v=ohmajJTcpNk
Perhaps using the intonation from the face-actor's voice to guide the speech synthesis.
I agree and I've upvoted you, but I feel it's worth pointing out that Adobe's claim about their own progress in this field was fake news.
https://www.youtube.com/watch?v=I3l4XLZ59iw&t=2m34s
"Wife" sounds exactly the same in both places. All they did was copy the exact waveform from one point to another. Nothing is being synthesized.
https://www.youtube.com/watch?v=I3l4XLZ59iw&t=3m54s
The word "Jordan" is not being synthesized. The speaker was recorded saying "Jordan" beforehand for this insertion demo and they're trying to play it off as though it was synthesized on the fly. This is a scripted performance and Jordan is feigning surprise.
https://www.youtube.com/watch?v=I3l4XLZ59iw&t=4m40s
The phrase "three times" here was prerecorded.
This was a phony demonstration of a nonexistent product. Reporters parroted the claims and none questioned what they witnessed. Adobe falsely took credit and received endless free publicity for a breakthrough they had no hand in by staging this fake demo right on the heels of the genuine interest generated by Google WaveNet. I suppose they're hoping they'll have a real product ready by whatever deadline they've set for themselves.
To be clear, I like Adobe and I think it's a cunning move on their part.
Thanks for the detailed breakdown. The irony is not lost!
How soon before you make an API available? In other words, how do I make use of Deep Voice for my own applications?
Right now, we do not have plans to make an API available. This paper and blog post are mostly meant to describe our techniques to other deep learning researchers and spur innovation in the field. However, we hope that these techniques will be available eventually, and we'll provide more information when that happens.
In order to not miss this announcement, do you have a mailing list we could sign up for to notify us when this becomes available? You have a LOT of people interested.
That's great! when will the code / service be available to the public??
Semi-related to the Baidu speech research: http://chrislord.net/index.php/2017/02/23/machine-learning-s...
The work is done by Mozilla
does anyone know of good ways to do the opposite, speech to text?
Not really. I keep my eye on this area as I generally transcribe my podcasts. But compared to ~$1.50/minute for human transcriptions that require minimal touchup for even fairly tech-heavy topics, nothing I've seen that's purely ML/computer-based comes close to being worth my time to deal with.
Depends on how good you're talking. Chrome supports the SpeechRecognition API.
https://developer.mozilla.org/en-US/docs/Web/API/SpeechRecog...
i'm more thinking of ways to programmatically turn long audio files into indexable text.
Julius[1] can do this. But the accuracy depends on the language model you are using, and unfortunately the free English language model (VoxForge) is not the best.
I'm unaware of a ML-based solution, but GCE has an endpoint that _can_ do this, though it is better at short sentences.
Google Cloud had a speech api and supports 80 languages. There is a demo: https://cloud.google.com/speech/
Lex by AWS. Its the same deep learning tech. used as used by Alexa
does lex actually translate speech to text for you? i was under the impression that it was for conversational bots.
Yea.. you are right.. I just assumed that would fit any use case where speech to text is needed. Which clearly is wrong.
Doesn't google api work for you ? I thought it worked perfect
> "We conclude that the main barrier to progress towards natural TTS lies with duration and fundamental frequency prediction, and our systems have not meaningfully progressed past the state of the art in that regard."
Who is working on this problem, and how?
We're working on this. Here is a very early demo of Julian. Don't be surprised, he sounds like a teenager with a high-pitched voice, recorded in his bedroom, because that's how the sample library was recorded. https://soundcloud.com/komponant/julian-speech-demo NB the expressions (durations, F0) are manually adjusted, not predicted by a NN. We've built a fully flexible text-to-voice engine, not the brain that goes with it. But we're looking for people with experience in ML to work on this, so feel free to contact us.
I have A LOT of pdf's I'd much rather listen to than read. Can't wait for this!
I hacked a script on top of PDF.js to make it read the text by TTS while highlighting the words on page. I'm a big fan of having the computer speak to me.
Unless you are doing some manual work at the same time, like ironing or something else that requires very little mental focus, I can't really see why?
You better try getting epubs. Ivona Amy us currently the best I know, I use it a lot on my phone.
OK, that went from uncanny valley to flipping amazing. I could picture the person speaking. An old lady. A young woman. It was hard to picture an algorithm in a machine.
It's amazing that is all boils down to 1s and 0s and some boolean logic.
You've misunderstood what you're listening to, I suggest reading the post again.
The recordings at the bottom are just recordings of an old lady and a young woman.
Yeah, I understood that. The ones in the middle are generated using their voices. You don't find that amazing?
I mean, it's sort of amazing, but it wasn't completely generated by machine. Those sound clips in the middle were generated by copying the inflections from actual recordings, not generating the inflections from scratch. It sounds like the current system they have sounds like the robotic voices at the very top.
It's not TEXT to speech, it's speech to speech. I think it would be amazing when we have TTS of that quality.
Has anyone seen this yet? https://www.youtube.com/watch?v=XfcqBElF0ZI
So many innovations happening with voice related technology..
It says they trained on 20 hours of a speech corpus subset. Will larger datasets influence the future of TTS?
how does this stack up against wavenet?
It's in the abstract. "... For the audio synthesis model, we implement a variant of WaveNet that requires fewer parameters and trains faster than the original ..."[1]
Disclosure: I'm one of the co-authors of the QRNN paper (James Bradbury, Stephen Merity, Caiming Xiong, Richard Socher) produced by Salesforce Research.
There are many interesting advances that Deep Voice paper and implementation make but the part I'm excited by (and which might be transferable to other tasks that use RNNs) is showing that QRNNs are indeed generalizable to speech too - in this case in place of WaveNet.
"WaveNet uses transposed convolutions for upsampling and conditioning. We find that our models perform better, train faster, and require fewer parameters if we instead first encode the inputs with a stack of bidirectional quasi-RNN (QRNN) layers (Bradbury et al., 2016) and then perform upsampling by repetition to the desired frequency."
QRNNs are a variant of recurrent neural networks. They're up to 16 times faster than even Nvidia's highly optimized cuDNN LSTM implementation and give comparable or better accuracy in many tasks. This is the first time that it has been tried in speech - to see them note the advantages hold across the board (better, faster, smaller) is brilliant!
If you're interested in technical details, our blog post[1] provides a broader overview and our paper is available for deeper detail[2].
[1]: https://metamind.io/research/new-neural-network-building-blo...
very nice paper - one of my colleagues discovered it. I have been trying to understand the details but I do not see how your stacked dilated layers are arranged. "d" is mentioned once but no description given
it's awesome~
Great work!
if i understand this correctly it's a pretty big achievement on the way to being able to replicate any persons voice in the future given enough audio samples. Amazing. Similarly i have seen lip movement (talking) be replicated using machine learning. Having completely artificial (or even real) identities saying whatever you want them to on video is not that far off i guess (simpler than general AI or even fully self driving cars), which is both amazing and terrifying.