Deep-learning text-to-speech tool for generating voices of various characters

283 points by clxxx 5 years ago · 87 comments

Reader

From the about section:

> How much does maintaining the servers cost? > It depends on the amount of traffic, but the minimum baseline is around several thousands of US dollars every month. This is expected as inference is very GPU intensive and a sufficient number of instances need to be spun up to handle thousands of requests coming in every minute. Everything is paid out of pocket.

Wow, impressive commitment for something that's free.

calebkaiser 5 years ago

The price of GPU inference can be brutal, but there's a lot you can do on the infra side to improve it:
- Spot instances
- Aggressive autoscaling
- Micro batching
Can reduce inference compute spend by huge amounts (90% is not uncommon). ML, especially anything involving realtime inference, is an area where effective platform engineering makes a ridiculous difference even in the earliest days.
Source: I help maintain open source ML infra for GPU inference and think about compute spend way too much https://github.com/cortexlabs/cortex
vsupalov 5 years ago

Yeah, running anything related to AI involves GPU instances. An alternative is to point people to using Google Colab where you can get access to a GPU for free, but that's not a smooth end user experience for most folks.
- aisofteng 5 years ago
  
  > running anything related to AI involves GPU instances
  This is not true. A _lot_ of AI applications use algorithms such as logistic regression or random forests and don’t need GPUs - partly, of course, because GPUs are so expensive and these approaches are good enough (or more than good enough) for many applications.
  - vsupalov 5 years ago
    
    Whoops, sloppy generalization on my part. You're completely right of course, thanks! I've been focusing on deep learning a lot lately, to the point where AI has become an alias for those exciting new GPU-heavy techniques.
Nican 5 years ago

Out of curiosity, as I have no visibility about the infra actually required- but at that cost, would it not be easier to just have a machine under a desk somewhere?
- calebkaiser 5 years ago
  
  Not for the kind of inference running here, I'd imagine.
  There are few key reasons why most realtime inference is done on the cloud:
  - Scale. Deep learning models especially tend to have poor latency, especially as they grow in size. As a result, you need to scale up replicas to meet demand at a way lower level of traffic than you do for a normal web app. At one point, AI Dungeon needed over 700 servers to support just thousands of concurrent players.
  - Cost. Related to the above, GPUs are really expensive to buy. A g4dn.xlarge instance (the most popular AWS EC2 instance for GPU inference) is $0.526/hour on demand. To hit $3,000 per month in spend, you'd need to be running ~8 of them 24/7. Prices vary with purchasing GPUs, but you could expect 8 NVIDIA T4's to run around $20,000 at minimum, plus the cost of other components and maintainence. To be clear, that's very conservative--it's unlikely you'll get consistent traffic. What's more likely is you'll have some periods of very little traffic where you need one or two GPUs, and other high load periods where you'll need 10+.
  3. Less universal of an issue, but the cloud gives you much better access to chips at lower switching costs. If NVIDIA releases a new GPU that's even better for inference, switching to it (once its available on your cloud) will be a tweak in your YAML. If you ever switch to ASICs like AWS's Inferentia or GCP's TPUs, which in many cases give way better performance and economics than GPUs, you'll also naturally have to be on their cloud.
  However, there is a lot that can be done to lower the cost of inference even in the cloud. I listed some things in a comment higher up, but basically, there are some assumptions you can make with inference that allow you to optimize pretty hard on instance price and autoscaling behavior.
mickof 5 years ago

You just sort of assume that this is correct? The person[1] running this comes across as a severely unstable character, that number is probably hyperbole.
[1] https://twitter.com/fifteenai
- 15ai 5 years ago
  
  Not a hyperbole – I can provide proof if you'd like.
  - nmfisher 5 years ago
    
    Separate question - is this English only? It looks like you can feed in phonemes but I assume this has been trained with English audio.
  - skavi 5 years ago
    
    Would you be willing to explain how you can justify offering this for free? I’ve subbed to the patreon, but that’s less than a drop in the bucket compared to the ~$10k you say this month will cost.
- nmfisher 5 years ago
  
  I’ve worked with deep learning models enough to know the cost of running GPU inference, and if the live queue stats published on the website are accurate, then thousands of dollars per month is certainly plausible.
  I have no reason to disbelieve it.
- hooloovoo_zoo 5 years ago
  
  It seems like one could get to those numbers pretty easily given the prices for GPU instances on AWS. Even just one decent-sized instance would be thousands of dollars per month.

danShumway 5 years ago

I don't usually expect much from demos like this, but I'm kind of surprised how impressive the results currently are. They're definitely not perfect, you're definitely getting some odd clipping and noise, but this shows a large amount of promise.

Being able to generate voices for games would enable a lot of interesting indie projects. IMO people should be paying more attention the market implications of products like this than to the social implications. There are a lot of projects that just aren't really feasible right now that could be if this kind of technology was more polished and generally available for commercial/self-hosted use. And in those cases, you don't even need to do inference, makers will likely be willing to mark up their scripts themselves.

Anyway I digress. Congrats, this is really cool!

Pfhreak 5 years ago

> people should be paying more attention the market implications of products like this than to the social implications.
People will absolutely suffer harm from this tech, but hey, think about the dollars that could be made! No, we should absolutely be paying more attention to the social implications.
- danShumway 5 years ago
  
  Eh, this technology currently falls very squarely into the category of "almost good enough that I could use it for a creative project, but not nearly good enough that you're going to be able to convince me that the results aren't generated."
  I'm not primarily interested about the dollars, I'm interested in allowing communities to do creative things. I think people are looking at this tech like it's only going to be used for deepfakes, and they're underestimating the extent it's going to be used to create voice-acted game mods, animations, anonymization tools, and other creative/helpful projects.
  If you're really worried about this stuff though, you can take some comfort in the fact that by far the worst examples on the site are of real-world voices. This is currently technology that as far as I can see is far more suited for generating new voices or voicing cartoon characters with well-defined patterns/inflections than it is for imitating the president.
  - bawolff 5 years ago
    
    It really doesnt have to be perfect to trick someone. You're expecting this site to be fake so you're listening carefully. If you weren't expecting anything and you were in the middle of a busy day at work, you are much much less likely to notice any discripencies.
    We already have stories like https://www.forbes.com/sites/jessedamiani/2019/09/03/a-voice...
    That said, as far as harms go, i dont think this is all that bad that it should preclude creative uses of this technology.
  - Pfhreak 5 years ago
    
    You are looking at the current implementation and not thinking about the implication.
    One, this tech absolutely could be used to fool someone. Not everyone will be listening with a critical ear. Played back over a phone or injecting a phrase or two in otherwise spoken samples will fool many people.
    I guarantee you someone will be using this to make their own MLP episodes on YouTube specifically designed to scare children or get them to do awful things.
    Models presumably get better over time. It really won't be too much longer until people will be able to fake celebrities, politicians, exes, authority figures, etc. As a fairly benign example, if I had this in high school you better believe I could have called to excuse some of my absences.
    I agree, I love the idea of generating some decent voice lines for my own games projects, but this also introduces issues of the rights of the original voice actors.
    If you train a model to mimic a performance given by an actor, then use that model and fire the actor, isn't that potentially really problematic? (Also, it draws parallels to the Luddites who were not anti technology, but wanted to ensure that technology wasn't used in a way that reduced worker quality of life.)
    And yes, I think there are helpful ways this could be deployed. I'm gender fluid, and I'd love to be able to adjust my voice digitally, but we need to be thinking about how this could cause harm first.
    
    danShumway 5 years ago
    
    > One, this tech absolutely could be used to fool someone.
    The problem I have here is that it's already not hard to fool people. I don't think it's feasible for us to say that we're going to put something that could be highly beneficial on hold just because we don't want to deal with social education efforts that we kind of already need to tackle anyway. Per your example, if we get rid of deepfakes, it's not clear to me that Youtube is going to be any more safe. I already would not allow a child to browse Youtube unattended, people already generate the videos you're talking about.
    And I know that people are putting this in a different category than general CGI, voice modulation, or consumer-grade apps like Photoshop. I'm not going to argue that it's necessarily wrong for people to be worried, but no matter how many times people tell me that this is fundamentally different, I still have not seen any serious evidence that this technology is going to be more dangerous than Photoshop, and I think it's going to be way easier to detect than a decent Photoshop job is. Photoshop's content-aware paste/fill tools are better than this example, and they arguably require less work to use.
    And again... I'm sympathetic to concerns about moving too fast, but I just don't think there's any world, even if you could get rid of deepfakes entirely, where we don't need to be worried about media literacy and general skepticism. If people today don't realize that voices can already be convincingly faked, then that's a really serious problem, and if democratizing that ability causes society in general to become more aware of the potential of disinformation, then honestly that might even be a good thing that we should be encouraging.
    So sure, concerns, but in my mind people are focusing on one particular implication that I don't think is particularly likely, and ignoring that responding to that concern is probably going to look the same no matter what our position on deepfakes is.
    > If you train a model to mimic a performance given by an actor, then use that model and fire the actor, isn't that potentially really problematic?
    I think that's a very complicated question. I would not assume that the loss of work for voice actors, who can shift into voice generation roles, is going to be a big enough downside that it overrules the upside of allowing ordinary people to start generating their own vtube avatars or commenting on and building on top of existing culture.
    
    Ajedi32 5 years ago
    
    > If people today don't realize that voices can already be convincingly faked, then that's a really serious problem, and if democratizing that ability causes society in general to become more aware of the potential of disinformation, then honestly that might even be a good thing that we should be encouraging.
    I've wondered about that angle as well. You can't put the genie back in the bottle, so maybe the best way to combat the threat of deepfaked misinformation is actually to take the opposite approach and make it as easy as possible for normal people to generate their own deepfakes; that way it becomes common knowledge that such things are possible (similar to how photoshop is common knowledge today).
    
    Erlich_Bachman 5 years ago
    
    > If you train a model to mimic a performance given by an actor, then use that model and fire the actor, isn't that potentially really problematic?
    And if you have to keep getting a person paid for something that a machine could do with (assuming, as per your post) 100% equal performance, that is not problematic? When the voice becomes as good as real actors, then yes of course they should become out of a job. Just like progress has been going on for thousands of years.
    
    visarga 5 years ago
    
    I am thinking it could be used to impersonate someone in a phone call to a family member for conning.
  - significant5 5 years ago
    
    I might be misunderstanding you, but there are no real-world voices on the site? All of them are of characters.
    
    danShumway 5 years ago
    
    I see a pretty linear drop in quality from Glados to Spongebob to Twilight Sparkle to the narrator from Stanley Parable to the 10th Doctor.
    It seems to struggle more and more as the voices get less cartoony/exaggerated.
    
    significant5 5 years ago
    
    I'm not too sure about that. From my testing, Fluttershy, Applejack, Twilight, Chrysalis, Rise, and Kyu (and a bunch of other characters that I'm surely forgetting) seem to perform phenomenally well. Especially Chrysalis, her emotions are extremely believable, and Fluttershy/Applejack/Rise/Kyu have almost zero noise for every generation. This might be the most impressive site I've ever seen.
    Oh, I somehow forgot all of the TF2 characters. Some of them do struggle (Medic the most, I think) but everyone else seems incredibly good.
    And the Daria characters, too. Honestly, the vast majority of characters are already near-perfect.
    
    danShumway 5 years ago
    
    Hrm. Well, I can't really argue with that beyond that my standards on perfect might be different.
    I think some of the best voices they have are characters like Twilight, she shows a ton of promise. But as it stands right now, I would still at least hesitate to use Twilight's voice in a project unless I didn't have other options. Chrysalis's voice is good, but again, is an exaggerated cartoon character with a large amount of inflection. I would not use her voice in her current state without a lot of post-processing. Someone like the Spy I would consider to be unusable, it sounds to me like the character needs to clear their throat or something, it's got a lot of strange artifacts. I definitely would consider the 10th Doctor unusable, even for just a hobby project or a voice assistant.
    But... I don't know, maybe this is subjective. I can't just tell you that what you're hearing is wrong, if you like the results then you like the results :)
    And again, I don't want to detract from how impressive they are. They are incredibly impressive, particularly because of how characters like Chrysalis emote. Extremely promising. But I still think there's a difference between 'impressive' and 'believable deepfake'.
    
    significant5 5 years ago
    
    Yeah, that's fair. I dunno, I can't really hear anything wrong with Fluttershy or Applejack no matter how hard I try, but your ears are probably much better than mine :p
    I've been seeing quite a few skits being posted on /r/tf2 (https://www.reddit.com/r/tf2/comments/kr374q/honestly_idk_i_...) and all of the voices sound pretty much perfect to me. But as you said, it's subjective.
- C19is20 5 years ago
  
  Musicians Union?
Ajedi32 5 years ago

I wonder if there are any legal concerns with using the voices of well known characters/actors like this in a commercial context.
- danShumway 5 years ago
  
  I don't think a voice can be copyrighted, but IANAL so you shouldn't bank on that.
  If a voice could be copyrighted, or if this was a trademark issue or something, I strongly suspect that this site would not fall under fair use regardless of whether or not it was commercial. But again, IANAL, so I don't feel confident making any kind of strong claim about that either.
  - dragonwriter 5 years ago
    
    > I don't think a voice can be copyrighted, but IANAL so you shouldn't bank on that.
    The audio content (which includes voices) of the source work is copyrighted, and a mechanical transform of that work (which deep learning to mimic the voices clearly is) would seem to be a derivative in at least the literal sense.
    
    thrill 5 years ago
    
    IANAL and I would say no. Anyone is free to imitate any else. A machine doesn't make that different. It would be a violation to claim you were someone else while doing the imitation.

vsupalov 5 years ago

The results are really impressive. At the moment I'm considering spending a low 3-figure amount for a professionally spoken intro for a new podcast. Some of the lines I generated are in my top 5 easily, human speakers don't have a lot of edge for short generic blurbs of text anymore it seems.

kebman 5 years ago

Pretty cool! I tried it with this small dialogue, and then edited together two voices in Reaper from the downloads:

Bob: “Hello, John.”

John: “Oh, hello there, Bob.”

Bob: “Yes, hello. It's what I said. Why do you keep repeating what I say, John?”

John: “I didn't repeat you! I merely said hello, you dimwit!”

Bob: “There you go, being condescending again. Fuck you!”

John: “What? You're the one who started it!”

Try it yourself, or write something different. Either way, good fun!

demonictoaster 5 years ago

The security implications of this kind of tech are scary. Going forward it will become really easy to reproduce the voice of anyone! It seems not a lot of training data is required to achieve reasonable results (e.g. Spong Bob is just 27min of voice, Half Life Black Mesa Announcer is just 1.9min!!). This stuff could be easily leveraged for scams and deep fakes (along with deep learning models that could also tweak lip movements to match the voice for example). Thankfully, there is also a very active area of research that leverages similar tech to detect deep fakes.

dschooh 5 years ago

These kinds of discussions are common with articles about deep fake video and audio. While I do not disagree with your point, here are two quick thoughts:
- We have had perfect image manipulation capabilities for quite some time now. We have had written text manipulation capabilities for hundreds of years.
- People will continue to believe what they believe, whether there is deep fake video and audio or not.
- demonictoaster 5 years ago
  
  Agree with you. Hopefully people are more and more aware that they cannot trust anything out there. We are soon reaching a point where we can make anyone say anything we want, including in audio and video format.
spyder 5 years ago

It's already happening:
A Voice Deepfake Was Used To Scam A CEO Out Of $243,000:
https://www.forbes.com/sites/jessedamiani/2019/09/03/a-voice...

Baeocystin 5 years ago

The fact that you included Chell as a voice choice (and 'generated' a null audio clip to boot) earns a chuckle. The quality of the voices across the board earns wide eyes and an eyebrow raise. Thanks for sharing this, it's remarkable work.

high_byte 5 years ago

GLaDOS hahahaha this is just... perfect. Stanley Parable Narrator funny you should mention this.

SommaRaikkonen 5 years ago

Welp, after messing around with a few voices I was completely impressed with Glados's. This is really cool because I have no idea how the character's voice was synthesized, but apparently ML can do it for me so props to that.

smrq 5 years ago

I'm pretty sure the real Glados voice effect is mostly pitch correction and formant shifting. You can do it with Melodyne at least (which, to be fair, is also computer magic-- just a different kind than this one!)
I just found a video on YT with an example of recreating this in Melodyne: https://youtu.be/1oQn66gvwKA
giantrobot 5 years ago

GLaDOS was voiced by a real person [0]. Her voice had some effects added but mostly just her trying to sound like a computer.
[0] http://ellenmclain.net/
aksss 5 years ago

My favorite is Carl Butananadilewski, but I just ended up making him say actual phrases from ATHF in the end. Was hoping to see Meatwad as a character option.

SV_BubbleTime 5 years ago

Is the author being cute putting Chell from Portal and Freeman from Half-life in there, and then there is no audio? It would be a weird oversight if not intentional because the author is clearly familiar with Valve games.

twangist 5 years ago

I get nothing but "Error code 422: Server error", even on input "Hello", in FF, Safari and Chrome.

durdn 5 years ago

You may need to choose a "Source" in the top left. I got the same error before choosing a character.
terrycody 5 years ago

and now it gives 400 error

mensetmanusman 5 years ago

As Alexa and Siri have improved over the last couple years and gotten a more human voice, it has been interesting observing my young children (1-4) interact with such devices.

There is definitely a sense of ‘who is that’ coming from their little minds that they are sometimes quite perplexed about. ‘It’s a computer’ is starting to feel like a cop-out answer as these things improve...

st1x7 5 years ago

You should really see what happens when you click reject on their terms and conditions prompt.

rkagerer 5 years ago

One of the voice actors is John de Lancie!

https://soundcloud.com/user-860705643/q-pandemic-rant-no-mus...

code51 5 years ago

I'm fearing this will end up with a massive debt on their part.

Roritharr 5 years ago

I'm pretty happy with the results I get. I've toyed around with a similar goal, but with the idea of approaching voice actors to give them a powerful tool to sell a "low quality" version of their voice in bulk. That way an up and coming author could use a tool like this and some elbow grease to create an Audiobook with famous voices.

superasn 5 years ago

Really impressive. Do you plan to implement an API like Amazon, Google that lets you generate TTS for price?

wongarsu 5 years ago

I too think that this has potential as a cloud TTS service. However that does open up all the moral and legal cans of worms around this. I could imagine some of the voice actresses not being very happy about somebody else commercializing their voice without their consent.
The obvious way to get around this is to keep this as the showcase and to pay some people to add their voices to the paid version. I imagine this would sell just based on being decent TTS with a wide range of voices, even when people don't know the voices offered.

hmate9 5 years ago

It’s incredible how little data is required for amazing output! Only a couple of minutes of talking needed.

You can find a couple of minutes of taking of anyone, so the security implications are huge!

EugeneOZ 5 years ago

Please give me a hint how to control the speed - Portal:Wheatly is too fast for me.

Amazing toy! Thanks for "download" link, I'm creating a collection of GlaDOS phrases now.

atum47 5 years ago

While you're typing the word the text box don't show it, when you complete the word then it shows on the text box. Brave, Android.

Besides that, amazing results. Congratulations.

junon 5 years ago

I'm rarely impressed by demos like this. This is a clear exception.

Not only that, but the creator seems cool and down to earth. Thanks for sharing, this is incredible work.

suyash 5 years ago

fun but what are the legal implications of using these voices for projects? Does the license cover the use of these voices?

trowngon 5 years ago

Are there open source projects like this?

CookieAnon 5 years ago

I have CookieTTS where I reseach lots of experimental stuff. (You can see my credits on the 'Thanks' section of 15.ai)
I can get about 90% of the quality of 15.ai currently. I think I could surpass 15.ai but not without some help.
nmstoker 5 years ago

There's Mozilla TTS https://github.com/mozilla/TTS
Here's a sample from a TTS model + vocoder I released for it. I've no wish to deter the motivated, but it'd take a bit of figuring out how to set things up and you'd need to read the docs and code to get oriented :)
https://m.soundcloud.com/user-726556259/sherlock-wavegrad-sa...
Links to the models are here: https://discourse.mozilla.org/t/creating-a-github-page-for-h...
Is originally trained on two novels read by the same narrator on LibriVox (ie in public domain)
- nmfisher 5 years ago
  
  This is actually quite impressive too, significantly better than the last time I looked into Mozilla TTS. Roughly how much audio does "two novels" equate to?
  - nmstoker 5 years ago
    
    Here's another sample with the same model+vocoder, this time reading from a Wikipedia article: https://m.soundcloud.com/user-726556259/q-learning-wavegrad-...
  - nmstoker 5 years ago
    
    It's about 32 hours of audio.
    As some of the audio is read in different accents to the main accent used, ideally the different accent audio would have been removed. Doing so would be expected to help with voice quality, reducing the overall amount used and, as a bonus, cutting training time too.
- terrycody 5 years ago
  
  Is there a simple interface like the example in this thread to use the tool for a non developer regard Mozila TTS yet? I can't find one...
  - nmstoker 5 years ago
    
    There's the demo server which has a simple web UI where you can input text to be spoken, but in regards to setting it up locally it's not that suited for a non developer
    https://github.com/mozilla/TTS/tree/master/TTS/server
    https://github.com/mozilla/TTS/wiki/Build-instructions-for-s...
    There's also a version in docker: https://github.com/synesthesiam/docker-mozillatts
    And various Colabs too, which are fairly easy to get going with: https://github.com/mozilla/TTS/wiki/TTS-Notebooks-and-Tutori...

dnsiseuzb 5 years ago

How does this compare to wellsaid labs?

bravura 5 years ago

15ai, do you mind talking a bit about the methods you are using?

MartinoPalmitos 5 years ago

Half-Life's Gordon Freeman voice is really spot-on!

mvts 5 years ago

Nice work on the Gordon Freeman Voice :D

pure-struggle 5 years ago

will this be open source eventually?

pure-struggle 5 years ago

https://twitter.com/fifteenai/status/1342304487474606081
found an answer.
"There's no point in releasing a poorly done model, and to do so for the sake of popularity would be despicable. My goal is to achieve indistinguishability, which I certainly know is possible. Anything short of near-perfection is unacceptable. "
- 15ai 5 years ago
  
  I'm afraid this tweet is taken out of context. I had written this in response to complaints about the release date being delayed because I wanted to make sure that the released model (that is currently on the site) was the best it could be.
  I do plan to compile and publish my findings in the future, but nothing is set in stone yet. I know that the model can be improved even further, and I'd prefer to be as comprehensive as possible.
- whatshisface 5 years ago
  
  Releasing a poorly done intermediate result would give either competitors or colleagues a leg up in the race, depending on whether one sees them as competitors or colleagues.
- scrollaway 5 years ago
  
  Megalomania, always a great excuse.
  AI and ML users are massively benefiting from open source but too often refuse to release their data. It's like we're back in the middle ages and alchemy is back in style.
  - hooloovoo_zoo 5 years ago
    
    Judging by how the model and site are put together, I think this is some software engineer's hobby project. Not wanting to spill their secrets doesn't make them a megalomaniac for the same reason being a magician doesn't make one a megalomaniac.
    
    scrollaway 5 years ago
    
    Except magicians do actually share their secrets; there is an active trade around it, conferences, discussions and lots of reading material available. The barrier of entry is higher than any old open source project but it's not inaccessible and comparable to alchemy.
    I was talking about ML in general, not just this project. See OpenAI and their latest release for example: no public product, no trained model. Just alchemy.

uberman 5 years ago

This was amazing!

centimeter 5 years ago

This is extremely impressive.

I wonder if this will lead to a resurgence of "moon man" style videos with well-known characters rapping extremely offensive lyrics.

Meph504 5 years ago

seriously fuck anyone that is putting in forced time delays on their terms, how about you let me read what it is you are doing before requiring shit like this.

duckmysick 5 years ago

If you don't agree with the terms, including how they are presented to you, you can always reject them and leave the site.

Settings

Deep-learning text-to-speech tool for generating voices of various characters

Keyboard Shortcuts