PlayAI's new Dialog model achieves 3:1 preference in human evals
play.htI tried it with a paragraph of English taken from a formal speech, and it sounded quite good. I would not have been able to distinguish it from a skilled human narrator.
But then I tried a paragraph of Japanese text, also from a formal speech, with the language set to Japanese and the narrator set to Yumiko Narrative. The result was a weird mixture of Korean, Chinese, and Japanese readings for the kanji and kana, all with a Korean accent, and numbers read in English with an American accent. I regenerated the output twice, and the results were similar. Completely unusable.
I tried the same paragraph on ElevenLabs. The output was all in Japanese and had natural intonation, but there were two or three misreadings per sentence that would render it unusable for any practical purpose. Examples: 私の生の声 was read as watashi no koe no koe when it should have been watashi no nama no koe. 公開形式 was read as kōkai keiji instead of kōkai keishiki. Neither kanji misreading would be correct in any context. Even weirder, the year 2020 was read as 2021. Such misreadings would confuse and mislead any listeners.
I know that Japanese text-to-speech is especially challenging because kanji can often be read many different ways depending on the context, the specific referent, and other factors. But based on these tests, neither PlayAI nor ElevenLabs should be offering Japanese TTS services commercially yet.
So kind of unrelated, but the reading/singing of arbitrary custom lyrics on suno.com's v4 model has blown me away.
suno is uncomfortably good. I run a group for helping founders and sometimes I make little suno songs to accompany the classes for fun, always impressed by what it spits out. (prompt: song for founder who have happy ears bringing them tears > 30 seconds gen >) https://s.h4x.club/p9u4ezl2 / https://s.h4x.club/mXuND7Eb / https://s.h4x.club/L1u2DYzW
Suno songs always have way too much treble or reverb, or something I can't quite put my finger on. They're very bright sounding.
I don't think it's a fatal flaw, but I hope future versions improve on this, or Suno starts doing some more post-processing to address it. I know there's a new "remaster" feature, but I'm not sure if it does anything there either.
Yeah they're way too wide and not muddy enough, if you're gonna be as wide as they often are you need to fill it, else they always just sound over produced/weirdly bright. I was thinking earlier about why they sound really good but not actually good and then I realized that's how I feel about most modern pop music anyway. I think the main thing you're hearing, or at least the thing I find annoying, is if you listen close to how the AI does harmony it seems to almost be cloning the original vocal line and pitching it up and out so it's slightly offset feeling giving the appearance of a second vocalist, its a tick I do in abelton to see what things might sound like build differently with vocals and it feels very much like the sumo fake backing singers. I do think they're like 6 months or so away from nailing a lot of this given how quickly they've been moving, I follow them closely and it's been impressive. (you can also pull meatier stuff out of it if you work it a bit: https://s.h4x.club/04uz6klg - don't think this sounds particularly "AI" at all - edit: turns out if you play up in chamber orchestra and choir in the prompting you can get some much better stuff out of it: https://s.h4x.club/eDubr9xJ)
AI generated music is like AI Art.
It feels really generic. But to be fair a lot of art is just like that.
How many animated shows use the Family Guy laziest common denominator style. Storylines that are written to be easy to follow and mundane.
Ask Chat GPT to write a complex story about divorce and trama. It'll either refuse to it or come up with a Hallmark ending.
I mean yes and no, like if you just let it generate lazily, then yes. However if you work on lyrics and generate a bunch of samples.. no it can be very powerful and artistic.
Alternatively, text that is input to these services should be passed through a normalization process, i.e. use LLAMA to convert kanji to hiragana or a romanization. The TTS output is then much better.
Unfortunately, a simple normalization of kanji --> hiragana throws away pronunciation information.
You could just as easily use the LLM to convert the kanji into phonemes.
You can't lose word boundaries and phonemes don't tell you which part of the word is emphasized.
Modern TTS engines use tokenizers to convert words to phonemes. See: https://github.com/FunAudioLLM/CosyVoice/issues/202
Yeah there are no good options for Japanese yet (except maybe in Japan but I haven't heard of good Ai models for speech locally)
Anecdotally gpt-sovits is quite good at japanese, I can't evaluate first hand as my japanese is trash.
For some reason, most of these (and other narration AIs) sound like someone reading off a teleprompter, rather than natural speaking voices. I'm not sure what exactly it is, but I'm left feeling like the speaker isn't really sure of what the next words are, and the stresses between the words are all over the place. It's like the emphasis over a sentence doesn't really match how humans sound.
Yup, and that's going to be the case until AI's can really model human psychology.
Speech encodes a gigantic amount of emotion via prosody and rhythm -- how the speaker is feeling, how they feel about each noun and verb, what they're trying to communicate with it.
If you try to reproduce all the normal speech prosody, it'll be all over the place and SoUnD bIzArRe and won't make any sense, and be incredibly distracting, because there's no coherent psychology behind it.
So "reading off a teleprompter" is really the best we can do for now -- not necessary affectless, but with a kind of "constant affect", that varies with grammatical structures and other language patterns, but no real human psychology.
It's a gigantic difference from text, which encodes vastly less information.
(And this is one of the reasons I don't see AI replacing actors for a looong time, not even voice actors. You can map a voice onto someone else's voice preserving their prosody, but you still need a skilled human being producing the prosody in the first place.)
What if you have it read the script, then say, “hey, at this point, what is the character feeling? What are they trying to accomplish? What is there relationship to each person in the scene?”
And then you get that and prompt the model to add inflection and pacing and whatever to the text to reflect that. You feed that into the speech model.
It seems like it could definitely do the first part (“based on this text, this character might be feeling X”); the second part (“mark up the dialogue”) seems easier; the third part about speech seems doable already based on another comment.
So we are pretty close already? Whatever actors are doing can be approximated through prompting, including the director iterating with the “actors”.
> What if you have it read the script, then say, “hey, at this point, what is the character feeling?...
Sure, but now how do you make sure all the answers to those questions are consistent? Across clauses, sentences, paragraphs? To do that, you need to have an entire understanding of human psychology.
And I haven't seen any evidence that LLM's possess that kind of knowledge at all, except at the most rudimentary level of narrative.
Just think of how even professional directors struggle to communicate to an actor the emotional and psychological feeling they're looking for. We don't even have words or labels for most of the things, and we say "you know how you feel in a situation when <a> and <b> but <c>? You know that thing? No, not that, but when <d>. Yeah, that." Most of these things operate on an intuitive, pre-verbal level of thinking in our brain. I don't think LLM's are anywhere close to being able to capture that stuff yet.
You have to shape the voices in the tools, if you just spit them out they're junk but if you take the time to shape the voice a bit it gets better quickly, this is a cheap 11labs voice with 30 seconds spent on some basic shaping: https://s.h4x.club/bLuNlJWx
Still a bit teleprompter-ish but there are tools to go in and adjust pace and style throughout and you probably hear a lot of stuff with people not using those creative features. 11labs might very well be one of the best bits of software I've used, it's a great deal of fun to play with and if you're willing to spend the time the results are superb - I don't even have a use case, I just like making them because they're fun to listen to, ha!
What is shaping the voice means?
You---can, really--slow, speed up or change, how, things sound, by, -- using queues like this, to control how the voice,,, - tells the story {{3sec}} - once you find a voice you like, you can go in and {{1sec}}
//
control how,
it goes about {{1sec}} story telling.
PlayHT's voices are nowhere near as good as ElevenLabs. These self-reported studies are marketing.
In any case, voice is such a thin vertical that I half expect the Chinese to release an open source TTS model that out-performs everything on the market. Tencent probably has one of these cooking right now.
Oof. I've heard recent AI generated narrators, and they were OK (much better than a few years ago, much worse than professional humans), but something about the digital postprocessing in this article's youtube video reminded me of fingernails on a chalkboard.
I couldn't get half way through.
This is an excellent point: there's some sort of oddity where it's hard to say its definitively AI, but I can definitively say it's a...low quality human?
I really like "off a teleprompter", it accurately characterizes the subtle dissonances where it sounds like someone who is reading something they haven't read before. 0:14 "infectious (flat) beatsss (drawn out), which is near diametrically opposed to the paired snappy 0:12 "soulful (high / low) vocals (high)."
Most of these models are trained on audiobooks, which could explain the teleprompter feeling vs a natural conversational feeling.
I love the tech. I hate that it gets used to fill YouTube with zero-effort slop. I don't have a solution.
I realize it’s hard to face, but it’s ok to admit that (cool as the tech is) some things just have a net negative on the world. It’s just engineers and data scientists using their enormous talents to make world a worse place, instead of a better one.
It’s not going to happen, but the only solution is to just stop developing it.
The cat's out of the bag, if someone stops then someone else would start.
I would entreat people to consider the net effect of anything they create. Let it at least sway your decisions somewhat. It probably won't be enough to not do it, but I think of it more as the ratio between net positive :: net negative, and paying attention to that ratio should help swing it at least somewhat -- certainly more than giving up and ignoring the benefits :: harms.
The whole idea of developing AGI (even if LLMs are probably the wrong approach) is so strange when you think about it.
The smartest people in the world are working very hard in order to make themselves completely redundant and cheaply replaceable. If they succeed, they will turn their main skill and defining characteristic into a meaningless curiosity.
And life will not be better, the manual work of today will still need to be done and robots are not up to the task. Even with better programming for cheap, the hardware (and I mean even simply the metal) is too expensive compared to human hands.
Do you mean you think we'd all be superfluous for any thinking jobs, and would only be useful for manual labour?
Is there a chance this doom scenario is wrong and we end up with something approaching the idealists benign AGI post scarcity world?
The trend is that once ML models are able to compete with humans on a task, they perform is much faster and cheaply (although very often with a much higher error rate for now).
IMO thinking will be useful to some extent because the bottleneck will be getting instructions from the models into people to perform them. So if the model has to spend less time communicating with you (limited by human cognition, not the model), you will be more productive.
All work that can be done remotely can also be done by AI. Currently there are limitations like you having more context from communicating with your colleagues.
I suspect at some point, companies will try to mandate that all work-related communication (even face to face) has to be made accessible to AI. They will no doubt try to go two steps forward and then back off one step as a compromise so it will be a local model in the name of privacy or something like that.
At that point, AI will have the same work-related knowledge you have. In fact it will be able to cross reference knowledge from all workers in the company. So why would you work from home if it can do everything you could but faster and cheaper?
What is your conclusion? Will it be a benign benefit, bringing us closer to post-scarcity, or will we be doomed to a dystopian existence? Can you elaborate on how you see it?
Post-scarcity can work if everyone can say "here are me needs, they are now fulfilled, I have a happy and satisfying life". The issue is some people have a deep need to have more than others so they will always strive to have more, including power. AI, if it's ever created will be just another tool they'll use to get it.
one issue is that having a good generative model is unavoidable as a component for lots of good, useful tasks. like translation, transcription, etc.
but then of course if you have a generative model, you can use it to generate stuff.
I wanted to use this to do temp voices for a video game project. Not realtime, just creating the audio at build time basically. However, the pricing model is not conducive to that because you cannot pay-per-use, and on top of that none of the lower cost plans support more than 10 requests per minute so its difficult to use for batch operations. $299/mo seemed steep for my use case of infrequent bursts, and they couldn't help me with a custom plan, so I have ended up just using Azure AI Text-to-Speech. (Which is also much faster to render.)
I've been messing with the open source side of audio generation, and expressiveness still takes work but it's getting there. Roughly summarized my findings are:
- zero shot voice cloning isn't there yet
- gpt-sovits is the best at non-word vocalizations, but the overall quality is bad when just using zero shot, finetuning helps
- F5 and fish-speech are both good as well
- xtts for me has had the best stability (i can rely on it not to hallucinate too much, the others i have to cherrypick more to get good outputs)
- finetuning an xtts model for a few epochs on a particular speaker does wonders, if you have a good utterance library w/ emotions conditioning a finetuned xtts model with that speaker expressing a particular emotion yields something very usable
- you can do speech to speech on the final output of xtts to get to something that (anecdotally) fools most of the people i've tried it on
- non finetuned XTTS zero shot -> seed-vc generates something that's okay also, especially if your conditioning audio is really solid
- really creepy indistinguishable at a casual listen voiceclones of arbitrary people are possible with as little as 30 minutes of speech, the resultant quality captures mannerisms and pacing eerily well, it's easy to get clean input data from youtube videos/podcasts using de-noising/vocal extraction neural nets
TL;DR; use XTTS and pipe it into seed-vc, the e2e on that pipeline on my machine is something like 2x realtime and generates very highly controllable natural sounding voices, you have to manually condition emotive speech
What scenario would be considered "zero shot" voice cloning?
At the very least, wouldn't you have to provide 1 sample? Which would make it "few shot" (if that term really even makes sense in this context).
I think the key distinction is that there is no specific training data for that speaker. You can view the input as just the input voice to clone, not training examples.
It would be more like training examples if you had to give it specific phrases.
Xtts is non commercial use only though
I think XTTS is MPL now since Coqui folded, but I am not a lawyer and I am not using this for anything commercial so I haven't looked closely.
Testing with "Banana... Banana???? Banana!!!!!" Yields interesting results each time, and none so far the way a human would read it.
Do these services restrict the content that their AIs give voice to? If so, what are the typical restrictions? Like do they seek to prevent their tech being used for scamming, erotica, hate speech, etc? Or is it pretty much anything goes?
How, do you think, can they restrict that? Require that in the EULA, then sue someone who breaks the rules at a scale large enough to be worth the cost of the lawyers?
Or do you think they should analyze the text's sentiment and raise a flag if the sentiment is obviously breaking the EULA, e.g. some kind of hate speech?
How would you implement that?
Content moderation on text input is how I would assume it would be done, essentially the same way LLMs work now. But to be clear, I am not advocating for it. I’m just asking if they do it.
Maybe launched a bit too quickly? You can select "Swedish" for the non-Swedish voices, but the results are very poor. Far from useable. And there is no Swedish voice. So that language support claim is made a bit to soon I would say.
Also I found no way to filter/sort the voice selection modal on language, so I have to visually search the entire list.
To the founders: Would love to share my audio files with my team before we commit to a payment plan. Is there anyway to share audio files I've generated?
Our whole team is on Elevenlabs and a switch is significant work, but I think the results are worth it! Super awesome work!
I have a 3:1 preference for services that don’t unnecessarily require a mobile number to sign up.
Interesting OpenAI is left out from the comparison. I know they don’t do cloning, but there’s definitely overlap for narratives.
I wonder which comparison they hope to avoid, quality or price?
So, this is really impressive. Expressivity and pacing are wayyy better. Eleven Labs has been tops for some time, but the difference is pretty remarkable!
Thanks Peter! We think it really crushes for emotive text. Anything from storytelling to being emotionally reassuring. Still a lot of things up our sleeve too!
I have a particular use case I’m interested in using agents for - any chance you want to have a call?
In brief I’d like to be able to generate conversations via api choosing voices that should be unique on the order of thousands. Essentially I’m trying to simulate conversations in a small town. Eleven is not set up for this.
Ideally I’d be able to pick a spot in latent space for a voice programmatically. But I’m open to suggestions.
wow! The preview is amazing! I would've 100% assumed those were human narrations if I wasn't given leading context.
Awesome!
I signed up and right from the start I have -3 words left. LOL