Settings

Theme

Weird A.I. Yankovic: a cursed deep dive into the world of voice cloning

waxy.org

328 points by waxpancake 2 years ago · 206 comments

Reader

mecredis 2 years ago

It's kind of wild that these tools just transfer a copy of these models every time they're spun up (whether it's to a Google Colab notebook or a local machine.)

This must mean Hugging Face's bandwidth bill must be crazy, or am I missing something (maybe they have a peering agreement? heavily caching things?)

  • satertek 2 years ago

    Their Python module caches the downloads, which is checked before downloading them again...but you're probably not wrong on the crazy bandwidth bill. Looks like they have crazy VC money though, considering the current climate.

  • civilitty 2 years ago

    Unmetered 10+ gigabit connections were on the order of $1/mbit/mo wholesale over a decade ago when I priced out a custom CDN so for the cost of 100 TB of data transfer out of AWS you could get a 24/7 sustained 10gbit/s (>3 PB per month at 100% utilization).

    Bandwidth has always been crazy cheap.

    • hotnfresh 2 years ago

      Not all connections are created equal. Even some big providers clearly have iffy peering agreements upstream that’ll manifest as terrible performance if you have a widely-geographically-distributed bandwidth-heavy load.

    • colechristensen 2 years ago

      Indeed. If you're not using a cloud provider bandwidth is extremely cheap.

      In fact locally I can get a 10 gbps home internet unmetered connection for $300/mo.

      I'm not sure how they'd react if I transferred 1 PB/mo though :)

      • azinman2 2 years ago

        That’s pretty expensive. Sonic offers 1-10gbps (depending on where you live) unmetered symmetric connections for $60/mo to the Bay Area… they’re also the only ISP that petitioned the FCC in favor of net neutrality.

        For work I end up transferring 50-150 gigs often, sometimes daily. Never heard a word from them that this has been a problem.

      • tomrod 2 years ago

        Is my math wrong here? 10 gbps -> 8s per 10 GB -> 800s per 1TB -> 80,000s per 1PB -> 22.3 hrs at full speed for 1 PB?

    • morkalork 2 years ago

      If you host copies of your data with a few big providers could you do something smart like detect and redirect requests from AWS to an S3 bucket and not pay for bandwidth leaving the provider?

  • anonylizard 2 years ago

    Huggingface has a strategic partnership with AWS.

    1. AWS is far behind Azure and GCP in AI, so they gotta partner up to gain credibility.

    2. Huggingface probably does face insane bills compared to github. But AWS can probably develop some optimizations to save bandwidth costs. There's 100% some sort of generalized differential storage method being developed for AI models.

    • fomine3 2 years ago

      AWS egress traffic charge is just outrageous so they can easily offer huge discount without improvement

    • jandrese 2 years ago

      One doesn't usually opt for AWS when their goal is to reduce transfer costs.

      • fnordpiglet 2 years ago

        Unless aws makes an agreement to not charge you transfer costs, they often do for various open source and software projects like this.

  • toddmorey 2 years ago

    Is hugging face just a model repository like GitHub is a code repository? Seems you can rent compute both cpu & gpu, but you are right that most models seem to be run elsewhere.

  • pdntspa 2 years ago

    I really wish I could configure this crap to cache somewhere other than my C: drive

    Or better yet, how about asking me where I want to store my models?

  • jonluca 2 years ago

    You can do a lot of these fully locally with things like RVC web ui or https://tryreplay.io/

minimaxir 2 years ago

This article only covers the musical aspects of AI voice cloning, but there's another dynamic to AI voice cloning that's more complicated: replacing general voice actors in movies/video games/anime (example: https://www.axios.com/2023/07/24/ai-voice-actors-victoria-at... )

Unlike musicians who can't be replaced without significant postprocessing, have enough money to not be impacted by competition, and have legal muscle, voice over artists:

- Can be reproduced with good-enough results from out-of-the-box voice cloning settings on ElevenLabs or an open source equivalent (Bark, VALL-E X)

- Are already underpaid for their work as-is

- Have no legal ownership of their voice since they are contractors, and their voicework is owned by their clients who may not be as incentivised in protecting the VO.

I want to write a blog post about it but I suspect most people on Hacker News won't be interested in a treatise on the cultural impacts of the voicework in Persona 5 and Genshin Impact.

  • sumtechguy 2 years ago

    What I find interesting is this aspect that eventually, these companies will hire some college kids who needs a couple thousand bucks and a free pizza. Have them read the right scripts. Sign the right 'give everything away' contract and just forever use their voice. Or do it sneaky. Have a voice assistant and in your ToS 'we can use a copy of your voice for anything'.

    The existing voice actors will be just out of work. There will be a small cadre of groups that want real voice. But for some projects that will not be that important.

    Its going to get crazy.

    • Legend2440 2 years ago

      They don't need that - they already have enough data to generate plausibly human voices that don't sound like anyone in particular.

      Voice cloning is a special case, these models are equally good at making new voices.

      • techdragon 2 years ago

        I’ve found it’s not actually as easy to get this stuff to sound different to the specific someone it’s trained on.

        • gwern 2 years ago

          Don't expect that to last more than a year or two, assuming it's even still a problem for the best voice-generation AIs. Generating high-quality is the hard problem; generating specific high-quality samples is, by comparison, a lot easier.

          Remember when Stable Diffusion was released a year ago and one of the big artist copes was "sure, it can generate random images, but it'll never be able to generate the same character repeatedly!" They were already wrong because Textual Inversion and DreamBooth were already published, and soon enough, ported to SD and now people could dump out thousands of images of the same character in the same consistent style etc (and did).

          • techdragon 2 years ago

            The issue is more that I can’t get the equivalent of a slider control to adjust one or more properties of the voice from the AI in real time… like a vocal fry slider to use an example of something most people are capable of deliberately doing when they want to… but the currently available models are pre-trained to sound like the average/median of one specific person (or character) and while I imagine tools will improve to control and customise the training of the models to customise this vocal output I don’t see a clear path from the current model architectural design to one where I can freely control the stylistic expression aspects of the vocal output without loading in a completely different set of model data trained for that new desired output.

            • gwern 2 years ago

              No, that's easy. We had the equivalent of that in GANs many years ago. If you've never seen GAN editing, here's a quick video: https://www.youtube.com/watch?v=Z1-3JKDh0nI (Background: https://gwern.net/face#reversing-stylegan-to-control-modify-... ) You just classify the latents and then you can edit it. These days, with pretrained models like CLIP, you don't necessarily even need a latent space: you can take a model which has been trained on sound/text descriptions, like AudioCLIP, prompt it with a text like "vocal fry", and then the generated samples are subtly skewed to try to maximize similarity with "vocal fry". You put a slider on that for how much weight/skewing it does, and now you have a slider control to adjust properties of the voice from the AI. If something like this doesn't exist, it's obvious how to do it. (Even the realtime problem is being solved by figuring out how to train diffusion models to do a GAN-like single pass: https://arxiv.org/abs/2309.06380 )

              • techdragon 2 years ago

                I didn’t get to really explore the GAN generation of ML work particularly well since I had no supported hardware (no desire to support the nVidia monopoly on ML work) and refused to blow money on cloud instances I’d probably forget at some point and wind up with a giant bill.

                It’s a really different world now I’ve got massive models running on my laptop thanks to Apple Silicon and the unified memory architecture, and the c++ ports of various diffusion image models and several families of large language text models work well on my AMD gpu too… it’s so much easier to participate in the current generation of applied ML work without having to go out of my way to have specific ML supported hardware.

    • HappyDaoDude 2 years ago

      I have said this will initially be sold as a feature on things like Audiobooks.

      Pick your book, pick your reader and away it goes. The Diary of Anne Frank read by Gilbert Gottfried.

      • trafficante 2 years ago

        Not sure if your hypothetical was meant to be a reference to the absolutely hilarious classic “Gilbert Gottfried reads 50 Shades of Gray”, but it has me wondering how much of the inherent comedy comes from “the voice” and how much comes from the idea that the man himself sat down and recorded those lines.

        https://youtu.be/XkLqAlIETkA (Extremely NSFW without headphones)

        • KineticLensman 2 years ago

          > wondering how much of the inherent comedy comes from “the voice” and how much comes from the idea that the man himself sat down and recorded those lines

          For me it came from the voice; I hadn't heard of Gilbert Gottfried as a specific person until I read this discussion. The reaction faces of the women listeners were also amusing.

        • totetsu 2 years ago
      • evilduck 2 years ago

        I still like getting surprised when a new or unorthodox narrator knocks it out of the park but I’d really enjoy a “salvage this purchase” exit hatch with a AI voice alternative. I’d even pay a buck or two on top of an existing purchase to automatically fix a bad narration.

        Head over to Audible reviews, some books are widely considered to be great books as written but the audiobook is reviewed as one to be avoided because it was recorded poorly, the narrator paced it wrong, they had an annoying voice, they couldn’t do a voice of the opposite gender, whatever.

        Plus it seems like a great accessibility feature. Many books are recorded for the vision impaired community by volunteers and that’s admirable, but some of the AI today does a much better job.

        • HappyDaoDude 2 years ago

          These are some very fair points. There was one book 'Electron Fire' all about the creation of the transistor, I think. I say that because never have I heard a more unenthused narrator. Makes Henry Kissinger sound like a dramatic actor.

          Any AI voice could save that one. Any of them! Heck the original voice on the 1984 Machintosh could do better.

    • minimaxir 2 years ago

      Recent voice models by OpenAI, Meta, and ElevenLabs all state upfront they work with paid professional voice actors, so this space will get intetesting fast.

    • hiccuphippo 2 years ago

      Mozilla has a voice data project where people already do it for free(dom) ;)

      https://commonvoice.mozilla.org/en

  • supriyo-biswas 2 years ago

    HN isn't the only community to write for. While most people here seem to be unsympathetic to such job concerns, unconventional articles do hit the front page from time to time.

    I'd like to read it, in any case.

    • pixl97 2 years ago

      The get rich at any cost type like to post on these articles at a higher rate I think. When you read a larger and broad range of HN posts you see a substantial part of the population here has concerns about this.

    • rcarr 2 years ago

      +1, I would also like to read it

      • galangalalgol 2 years ago

        I would as well. It isn't that I'm unsympathetic, it is just that we haven't outlawed technology that put others out of work, and I'm curious why we would decide as a society this time should be different. If there are good reasons I want to know.

        • mschuster91 2 years ago

          Putting people out of work is one thing, that's bad enough and societies should take care to guide change and support those affected.

          The danger behind AI and other manipulative technology is that it erodes trust. We already have serious issues with trust in media, and not just the obvious cases of Russian/Chinese propaganda, but also stuff like kids getting anorexia from extremely photoshopped advertising.

          Add AI on top and no one can be certain about anything anymore. Say someone distributes a fake "recording" of the US President calling for glassing Moscow, or the Serbian President declaring war on Kosovo? That has the potential to actually cost lives on a massive scale.

          • galangalalgol 2 years ago

            Yeah, all that is bad, but those consequences are already here aren't they? Restricting further research just means it will be done in clandestine government labs like chemical or biological weapons except with equipment costs orders of magnitude lower. I can imagine policies that would save the jobs of voice actors, but none that would prevent the wave of deepfake propaganda that is coming.

  • ImprobableTruth 2 years ago

    Voices are uncopyrightable, but impersonation isn't legal (see Midler v. Ford, for a notable case), so I don't think the situation is totally clear.

    • deepsun 2 years ago

      > voice actors are fearing that the ability for generative AI to replicate their voices may cost them work

      I'm not sure how to feel about that. I'm against the idea that some people "deserve" being paid for being lucky born with an interesting voice.

      On the other hand, the world always worked like that. And, say, hard-working farmer or doctor were also lucky being born with necessary traits to make for their living, while others weren't.

      • minimaxir 2 years ago

        Voice acting is more than just talking into a microphone. It's a skill not limited to the quality of voice.

        • deepsun 2 years ago

          A lot of skills are not simple, but computers have taken over them anyways. For example, financial bookkeeping is not just writing and storing the books, it's a professional skill with many tricks to learn. However, databases and spreadsheets have taken the major part from those jobs. Same could be said about programmers who learned the skill of programming Assembly language. Or performing -- vinyl records and CDs has largely taken over orchestras and traveling musicians.

          I would vote for it only if it somehow encouraged voice actors to experiment and create new interesting styles. Kinda like patents were designed to do -- encourage inventors (although recently it became controversial in IT world).

        • bugglebeetle 2 years ago

          Yes, everyone has a voice. The amount of people who can convincingly act with said voice is remarkably small and requires a good deal of innate ability or training, generally both.

        • vunderba 2 years ago

          You could have made that argument more effectively in the past when voice actors had to be able to mimic multiple voices (Dan Castlenetta, Mel Blanc, etc.). Nowadays, we're seeing more and more shows where the voices of the characters are just... the normal voice of the voice actor.

          Of course it's not totally devoid of skill, you need to be able to emote, inflect, and convey emotion, but the bar is far far lower.

          • fomine3 2 years ago

            I argue that emote is more important skill than switching multiple voice character, though latter contributes get more jobs

      • vasco 2 years ago

        > that some people "deserve" being paid for being lucky born with an interesting voice

        Majority of success is attained like this though. Athletes paid for being born strong tall and fast, models paid for being pretty, rich families being paid for being born rich, smart people being paid for being born smart, or hardworking, etc. It's the most dominant factor everywhere.

    • sofixa 2 years ago

      It's always funny to me when people cite old American case law and try to wrangle their heads around how that can apply to a situation which the case's participants couldn't have possibly imagined. Shouldn't the correct way to do this be new legislation being created after consulting interest groups to answer the modern problems which exist due to modern realities, like what the EU is doing? It seems much more sensible of an approach instead of wondering how a 15th century ruling's ruler would have applied his thinking about something they couldn't even dream of.

      • lazide 2 years ago

        Interest groups == lobbyists in this case. Which might explain some of the American hesitation.

        • sofixa 2 years ago

          Well yes, you need to ask representatives of the people that will be impacted by a law what the impact will be, assess expert opinions, etc. Lobbying isn't only the American political bribery system, there's legitimate reasons behind it.

          • lazide 2 years ago

            Of course! And that those with the deepest pockets are able to afford to have the most convincing folks spend the most time waiting for an opening in the various Representatives calendars is not surprising, and only natural.

            That it often results in them getting an equivalent mindshare (or more) of the Representatives views is also not surprising, and only natural.

            It doesn't inspire warm fuzzies in those too busy working to survive though.

      • flextheruler 2 years ago

        Your government class didn’t cover common law versus case law?

        • sofixa 2 years ago

          You probably mean common law, also sometimes known as case law, vs civil law which traces it's origins to the Napoleonic civil code, and which is used in all of the world outside of the former British colonies.

          My law classes did cover common law, yes, but not favourably(can you guess I come from a civil law country?). Sounds like a system that made sense in 15th century Britain, but is quite the complex beast with many issues nowadays when it doesn't need to be.

          However that still doesn't answer my original question, why is there no new legislation to cover the newly existing scenarios talked about? It seems to me that even the UK does that at least for some things, and they're the original common law country.

    • lazide 2 years ago

      As long as they don’t claim the voice is the original actor (misspell the name perhaps, or the Hollywood classic ‘based on’), they won’t be impersonating no?

      • gs17 2 years ago

        The Ford ad didn't say it was Midler, they just implied it by using her song with a soundalike. There was another similar case with a parody ruled as impersonation. I don't think there's good precedent for exactly where that line is drawn.

  • GuB-42 2 years ago

    Interesting note: many Vocaloids (most notably Hatsune Miku) are sampled from voice actors rather than singers.

    Singers didn't want software clones, but voices actors are fair game.

  • sublinear 2 years ago

    I have a different take on this.

    AI voice is cheaper, but it's also a more boring and generic performance. There is zero progress made towards any sort of creative AI that produces good unique work.

    The market for this then is small businesses who can't afford a professional voice actor. AI is opening up new markets, not killing the jobs of the truly talented.

    • chefandy 2 years ago

      This is the case for all generative "art." The people at the high end will still get paid well. The people who specialize in more utilitarian or low budget tasks in higher volume will take the biggest hit. Nobody who'd planned on hiring Morgan Freeman to do a voice over will be tempted to use AI Morgan Freeman instead.

    • aussieguy1234 2 years ago

      The MVP might have the free "good enough" AI voiceover, it takes less money to bootstrap a new product that way.

      The real product would have a real voice over actor paid for with VC money.

    • matteoraso 2 years ago

      >There is zero progress made towards any sort of creative AI that produces good unique work.

      It's only been a year. Give it some time and I'm sure AI will have much better results. Right now, you can get some of that unique work by finetuning the AI off of a person's existing portfolio.

  • zerojames 2 years ago

    I am interested! You should write about what you find interesting; never worry if it will interest a particular group.

  • foobarian 2 years ago

    It saddens me because of how much impact they had on my family as we played through the story line in Genshin and immersed in the world. At some point we met a few of the voice actors at a convention and they were like stars to us, while I'm sure their circumstances are as you describe.

  • raytopia 2 years ago

    I'd be interested.

    Most likey you'd see a lot of people saying that somehow getting rid of voice actors is good for "progress". Whatever that means.

    Random aside someone really needs to make a hackernews that focuses more on game development and other arts so blog posts like your talking about would have a proper community to discuss them with.

    • Legend2440 2 years ago

      Replacing voice actors with text-to-speech is good because it lets you do things voice actors can't:

      * Create dynamic new voice lines at runtime, for example game characters reacting to new situations.

      * Operate at a scale that's infeasible for humans, for example turning every ebook into an audiobook.

      • JohnFen 2 years ago

        Which are, in my view, really minor advantages when compared to the disadvantages. Not only in terms of putting people out of work, but in terms of increasing the artifice of the world around us and decreasing its humanity.

        • Legend2440 2 years ago

          "putting people out of work" by automating jobs is also a good thing.

          The amount of stuff humans can accomplish is strongly limited by the supply of workers. Automating one job frees them up to do other things.

          • JohnFen 2 years ago

            > "putting people out of work" by automating jobs is also a good thing

            Unless you're one of the people out of work. And even if you don't care anything about them, if there's enough of them then the resulting unrest will be your problem anyway.

        • cm2012 2 years ago

          There's little nothing more important to the happiness of humanity than increased productivity per capita. That sounds crazy but when you think about it it's true.

          • Koffiepoeder 2 years ago

            Well, this is a very one sided view on the world I'd say. From personal experience, I can surely tell you that I was much happier in countries where productivity was lower. The people there are just so much more pure of heart.

            • cm2012 2 years ago

              It's fine to visit, but in every measure of happiness people in poor countries are more lonely and report worse life satisfaction.

          • JohnFen 2 years ago

            > That sounds crazy but when you think about it it's true.

            I've thought a lot about it, and I don't think it's true.

  • dylan604 2 years ago

    > and their voicework is owned by their clients who may not be as incentivised in protecting the VO.

    The work product produced by their voice for fulfilling the contract is owned. No corp owns someone else's voice.

    • Jeff_Brown 2 years ago

      Porperty is a bundle of rights, and often hard to pin down. In the case of voices, if a company owns enough of your data to train a good simulacrum, and they have the right to do it, then they kind of do own your voice -- or more precisely, a damn good substitute.

      • minimaxir 2 years ago

        Case in point, Luke Skywalker / Darth Vader in the D+ series: https://www.vanityfair.com/hollywood/2022/09/darth-vaders-vo...

        > Belyaev is a 29-year-old synthetic-speech artist at the Ukrainian start-up Respeecher, which uses archival recordings and a proprietary A.I. algorithm to create new dialogue with the voices of performers from long ago. The company worked with Lucasfilm to generate the voice of a young Luke Skywalker for Disney+’s The Book of Boba Fett, and the recent Obi-Wan Kenobi series tasked them with making Darth Vader sound like James Earl Jones’s dark side villain from 45 years ago, now that Jones’s voice has altered with age and he has stepped back from the role.

      • b112 2 years ago

        Copyright is complex. And artist's rights are outside of copyright, in some respects. An example.. in the past, painters have had their works bought, and then hung in unfavourable conditions. Or in places/locations, which reflect poorly upon the work of art.

        Artists have sued, and won, to have artwork moved, shown differently, or force-sold back to the artist.

        Now, everything you say is copyright... you. At least in my legal jurisdiction! Even my image is, in Quebec! Yes, that includes if you take my picture outside.

        So what of one's voice? And if you don't have a real agreement, to use that voice in any way desired. And then you use that voice to.. I don't know, advocate for terrorists or something weird.

        What then?

        I don't think it's completely clearcut, and I think there will be changes, decisions on this going down the road.

        • dylan604 2 years ago

          We've seen plenty of examples of famous people suing companies for using their likeness in ads as if they are promoting a product. Tom Hanks' name is currently in the news for this.

          If a company uses an actor's previously recorded dialog to be edited in a way that makes them sound in favor of terrorism on the attempt to have people think the actor said the words, we have issues on so many levels. If the dialog is chopped/re-edited to use as dialog for the same character in later works, then I really don't have issues with it.

        • dylan604 2 years ago

          I pay little attention to SAG contracts, but after the Writer's Guild strike, I'd be expecting SAG to follow suit with major asks to protect its members from AI if they have not already covered it.

        • autoexec 2 years ago

          > Artists have sued, and won, to have artwork moved, shown differently, or force-sold back to the artist.

          That seems insane to me. Do you have specific examples?

          • rendx 2 years ago

            https://en.wikipedia.org/wiki/Moral_rights

            "Independent of the author's economic rights, and even after the transfer of the said rights, the author shall have the right to claim authorship of the work and to object to any distortion, modification of, or other derogatory action in relation to the said work, which would be prejudicial to the author's honor or reputation."

            https://en.wikipedia.org/wiki/Authors%27_rights

            "The authors of dramatic works (plays, etc.) also have the right to authorize the public performance of their works (Article 11, Berne Convention)."

            "The protection of the moral rights of an author is based on the view that a creative work is in some way an expression of the author's personality: the moral rights are therefore personal to the author and cannot be transferred to another person except by testament when the author dies."

            "“Author” is used in a very wide sense, and includes composers, artists, sculptors and even architects"

            Architects can deny changes in interior design: Lighting, artwork, etc., long after the building is finished. Just a few days ago I talked with a theater director: The author of the original work has the right to deny a production, for whatever reason, e.g. if they don't like the nose of an actor.

            I bet my voice is mine under most jurisdictions (and I mean most; the Berne convention has been signed by 181 countries), even if I signed a contract that gives you wide permission to use it. And if I didn't, you can't use it outside of the very narrow scope of the work I produced for you. Even if you simply want to reuse an existing recording in another context.

    • minimaxir 2 years ago

      They don't own the voice, but they own the vocal performance, which ends up being a meaningless legal distinction in practice.

      It's one reason why VAs rarely take fan requests for a character they voice.

      • dylan604 2 years ago

        If they are using their real voice, then they kind of screwed themselves. If they are performing a character voice, then at least they only lose out on that kind of work.

        I'm guessing contracts will need to be updated to say that a character's voice made from AI can't be used so a completely different production cannot say they have the actor attached for publicity purposes.

    • rockemsockem 2 years ago

      No one owns a voice at the moment. There is no mechanism in the US to own a voice, even your own.

      • leni536 2 years ago

        A person's voice is effectively owned by the corresponding person through right of publicity, which includes voice depending on jurisdiction.

        California, for example:

        "Any person who knowingly uses another’s name, voice, signature, photograph, or likeness, in any manner, on or in products, merchandise, or goods, or for purposes of advertising or selling, or soliciting purchases of, products, merchandise, goods or services, without such person’s prior consent, or, in the case of a minor, the prior consent of his parent or legal guardian, shall be liable for any damages sustained by the person or persons injured as a result thereof."

        https://leginfo.legislature.ca.gov/faces/codes_displaySectio....

        • rockemsockem 2 years ago

          Voices can sound very similar, they're far from unique. Clearly if you say or somehow strongly imply that a voice belongs to a specific person then that is protected. But what if you use someone's voice, someone not especially well known, and don't make any claims about where it comes from?

          • leni536 2 years ago

            It's still not necessarily legal just because you can get away with it.

            • rockemsockem 2 years ago

              I don't think it's that clear at all. You own your "likeness", but the limits of what that means is highly untested. Of the similar examples that have been tested in court thus far, the Ford v Midler case is the closest, but the court specifically called out the fact that as a singer her voice is a distinctive part of their identity, and so it is protected.

              https://en.wikipedia.org/wiki/Midler_v._Ford_Motor_Co.

  • aaroninsf 2 years ago

    <raises hand> I am

  • EGreg 2 years ago

    Please do. Some of us critique capitalism

  • rcarr 2 years ago

    It's sad if the only way voice actors are going to be able to make a living is by doing stuff like Critical Role on Youtube. I love Critical Role but it likely wouldn't be the same if those guys hadn't spent years honing their craft. Watching people play RPGs online has replaced a lot of my streaming viewing now, but the market is much smaller and I imagine it can only sustain a much smaller pool of creatives than the current voice over market can.

RecycledEle 2 years ago

Wow. I just realized any one of us could redo Weird Al's songs with his lyrics, but with the original singer's voice. We could be listening to Michael Jackson singing "Just Eat It" by lunchtime.

I am constantly amazed at how the new AI tech can be used.

Of course this would be illegal under most countries copyright laws.

  • unnah 2 years ago

    There's also a Weird Al piece "I think I'm a clone now", for which an AI clone voice performance would definitely be fitting. (The original song was "I think we're alone now" by Tommy James and the Shondells, but it seems Weird Al was parodying the cover by Tiffany in the 1980's.)

    While Weird Al himself asks for permission, it's well established that parody is not copyright infringement. There should be room for parody performances by AI voices as well, especially if argued by a good lawyer.

    • mbg721 2 years ago

      Al is very self-aware (that second character is a lower-case ell), he's less concerned with legal entities than with his relationships with musicians.

  • greenhearth 2 years ago

    How would this be amazing? It just sounds stupid and a waste of time.

  • RecycledEle 2 years ago

    And...they already did it.

mckirk 2 years ago

My absolute favorite application of this tech so far is The Beach Boys singing 'Hurt'. It's the first time I seriously didn't notice any artifacts, and it just works so well even though it really shouldn't.

Enjoy: https://youtu.be/gmNSFqyg_Z8

  • dwringer 2 years ago

    I don't know what I was expecting but that isn't Hurt, it's Surfin' USA with Hurt's lyrics that sound extremely jittery and grainy.

    I'm curious though if some AI soon could in fact synthesize the Beach Boys' style with the actual chords and melody from the NIN song, possibly with some of the pathos of Johnny Cash as well.

    • legitster 2 years ago

      I agree. The "x words over y music" can be fun, but isn't really impressive as a true genre parody.

      The one that always comes to mind for me is this video of an Eminem interview done from scratch as a Talking Heads song: https://www.youtube.com/watch?v=Kfl3N9nesRg

      This is potentially something that generative AI could be good at doing (at least recreating vocals), but this parody of the Talking Heads required a lot of very clever insight into what made a good Talking Heads song and returned a convincing and novel melody. And I think we are still a ways off.

      • adamesque 2 years ago

        Yeah, Nick Lutsko is super super funny and a very talented musician. That’s hard to replicate.

      • kristopolous 2 years ago

        More evidence that AI is an artistic tool of new media and not a replacement of the old.

    • sumtechguy 2 years ago

      The one I found fun was the matrix ice ice baby mashup. That was sort of janky but good enough to be fun.

    • tracerbulletx 2 years ago

      The name of the channel is "There I Ruined It".. That's the point, the person who created it did it specifically to make you feel like that.

    • darkerside 2 years ago

      Yeah, I hate it to the point of being personally offended. It has nothing to do with Johnny Cash's rendition. I'd probably feel a bit better, but not much, if it were advertised as a NIN mashup.

      • mock-possum 2 years ago

        Yeah that’s kind of the theme of the YouTube channel - I think it’s hilarious honestly, but maybe you have to go into it knowing what to expect.

        • darkerside 2 years ago

          Yeah, based on the parent, and the genius of the musicians involved, I was expecting something more than the sum of its parts. Hurt is an incredibly powerful song, and the Cash rendition imbues it with another beautiful layer.

          As a joke, I can see it being funny, but it was a jarring way to experience it.

      • aidenn0 2 years ago

        NIN and Cash have equal billing on that video. Many people might only know Cash's rendition...

      • cm2012 2 years ago

        It's definitely in the realm of "soulless"

  • danjc 2 years ago
  • code_runner 2 years ago

    This account is one of the absolute top tier creators for weird music mixes. The recent deep faking stuff has been shockingly good. I think this is a good example of an "acceptable" use of AI, as long as artists/composers etc rights are all settled.

    its always more fun when its a real group of talented people being silly, but I'd listen to an album of weird mashup like this for sure.

  • hinkley 2 years ago

    The graininess of the recording covers over a lot of potential problems. But given that this attempt keeps the Beach Boy’s tempo and enunciation, I think this technique, whatever it is, would make a much more compelling version of Michael Jackson covering Eat It.

  • nsbk 2 years ago

    That hurt

distantsounds 2 years ago

The sampled voices sound neither like Michael Jackson nor Weird Al. A good effort, but a professional impersonator could likely do better on either front.

  • nemo44x 2 years ago

    It sounds like Weird Al trying to be Michael Jackson trying to be Weird Al.

    • Reventlov 2 years ago

      As a non native speaker, it does sound a bit like Michael Jackson imo…

      • hinkley 2 years ago

        Sometimes I’ll watch a movie with voiceover work, where some character has a very specific accent, and I’ll be watching along for twenty minutes and the VA will let slip just a couple syllables of their real voice and my ears will prick up and I’ll think, hey I know this guy. Isn’t that… oh the guy from the thing. From <wrong movie>, no wait I mean <other movie>? Yes, it is.

        That’s what this sounds like. Five syllables of Michael Jackson while he’s trying to be Action Hero or Big Villain, or Funny Sidekick (a problem Eddie Murphy has never had, all evidence from Coming to America notwithstanding).

  • hinkley 2 years ago

    The best Michael Jackson interpreter in a town of 50,000 could do better than this. It’s… this is bad.

  • code_runner 2 years ago

    I know what you mean. Its more noticeable (imo) on the Michael one.... but its definitely in there. I think the pitch correction is to blame for a bit of the weirdness.

causi 2 years ago

AI song covers are incredible, from Goku singing "Don't Stop Me Now" to the cast of Spongebob singing "Ocean Man".

simonw 2 years ago

I did not know about this: "The center of the A.I. cover songs community is a massive 500,000+ member Discord called A.I. Hub, where members trade new tips, tools, techniques, and links to their original and cover songs."

  • codetrotter 2 years ago

    Me neither. That’s what’s so weird about the internet.

    Imagine half a million people out in the streets together. You’d definitely notice that. Meanwhile, we can have these massive online communities and you’d never know unless you accidentally stumbled across it or someone told you about it.

    • evan_ 2 years ago

      more accurate to say that, while 500,000 people joined the discord by clicking a link, some much, much smaller number are actually active on any sort of a regular basis

      • LordDragonfang 2 years ago

        Yeah, one of the "worst" (good for metrics, bad for legibility) parts of the trend of moving to discord for any sort of online community is that you have to "join" the community to even view any of the resources ensconced within. Meaning it's poorly indexed (discord search is okay, but not great) and not available at all to external crawlers.

        • throwaway290 2 years ago

          If this community was available for crawling then LLM would crawl it and there would be no value in participating in the community because you can just ask the LLM about all that, no?

          • LordDragonfang 2 years ago

            If the value your community provides is low enough that it can be effectively replaced by a general purpose LLM, then it should be. The value of a community should be pushing the boundaries of knowledge, not gatekeeping it.

            C'mon, this is hacker news, what happened to "information should be free"?

            • BeFlatXIII 2 years ago

              > C'mon, this is hacker news, what happened to "information should be free"?

              We've had an infestation of "pay me or I won't share" types.

              • throwaway290 2 years ago

                More like an infestation of "open = I must be able to exploit and abuse for profit" entitled SV VC types.

            • throwaway290 2 years ago

              "Information should be free" doesn't work when you have Microsoft inserting itself as a middleman of information.

              The community is not gatekeeping knowledge, anyone can join. It merely tries to keep certain corporations out...

      • thomastjeffery 2 years ago

        So to continue the analogy comparison, 500,000 people walked in that street at some point. Some unknown percentage of that number is made of unrecognized duplicates (same person new username).

      • dylan604 2 years ago

        this sounds like the description of most "new" social platforms. we see immediate interest, and then a sudden loss of that interest

    • lmm 2 years ago

      > Imagine half a million people out in the streets together. You’d definitely notice that.

      In the streets, sure. Meeting up at out of town conference centers a few times a year, probably not. Most real communities have always been "dark matter" to those outside them; Discord working the same way feels more authentic than most of the internet.

  • joenot443 2 years ago

    Something I think we're slowly coming to terms with is that the current generation of techies (the ones who can afford to spend hours upon hours tweaking models and sharing results) really prefer Discord over our Web 2.0 forum type communities like this one. Even reddit on, which is lagging in popularity amongst Gen-Z when compared to Discord or TikTok, you can immediately tell upon reading /r/LocalLLMs that a really big chunk of this community are underaged. To be clear, I think this is a good thing!

    There was a generation that preferred mailing lists. There was a generation that preferred IRC and BBS, and "my" generation which likes forums and lengthy comment threads. One would be naiive to think this style (the one we're engaging in here) would last forever.

    There are definitely very real criticisms of Discord, searchability and discoverability being the most common, but at this point I think the die has been cast. Young people have made their choice.

    • BandButcher 2 years ago

      Agree, im in my early 30s and jump through most platforms, but very little with tiktok/discord. but i have to admit a lot of newer content (and tech framework support) has migrated to discord channels. Even some YouTube sports talk shows have their own discord for call ins, etc...

      These big teleconference apps are usually hit or miss but discord seems to be the winner currently for actual "social networking", also add in its trend in the gaming community

    • tavavex 2 years ago

      I kind of disagree? I am gen Z myself, and have used reddit extensively. While I like Discord a lot, I strongly disagree with using it to host content, essentially gating non-members from getting what they want (which is what leads to these communities with ludicrously inflated member counts). And this sentiment definitely isn't just me, a lot of the techie "CS major" people I know lean towards using slightly older services - which is also probably why the aforementioned /r/localllama community still has more than 60 thousand members.

      That being said, Discord does have some advantages over older forum-type communities - it's usually way better for cultivating smaller communities, and its no-effort-required chat systems means that you can always hop on and discuss things that are on the cutting edge. This is quite important in a field like AI, where it feels like something revolutionary happens every other week.

      (Also, I don't know if that implication was intentional, but gen Z and "underaged" haven't meant the same thing for many years now)

      • joenot443 2 years ago

        That's good to know! Yeah - I shouldn't imply that these preferences are universal or absolute, just trends I've personally noticed.

        Glad to hear you and your peers are still posting on the open web!

    • ThrowawayTestr 2 years ago

      Are we so out of touch? No, it's the children who are wrong.

  • jrm4 2 years ago

    I poked around there for a while, and my takeaway was "sub-par" all around, which might be the reason for it's relative obscurity? The thing is, I can't tell to what extent it's the tech, and to what extent it's just "very uninteresting source material."

    Like, there's a whole lot of "classic song done by presently popular rapper," and I'll be the first to insist that there is nearly nothing vocally interesting at all coming from todays popular hip-hop artists (and I say this as an extreme long-time hip-hop aficionado)

smath 2 years ago

Related article from 1 year ago on Darth Vader’s voice being AI generated going forward:

https://arstechnica.com/information-technology/2022/09/james...

mito88 2 years ago

"celebrity voices impersonated"

Watch Light My Fire on YouTube Music https://music.youtube.com/watch?v=lN3v3EfA6_A&si=_hcG3Wjakxd...

ddmf 2 years ago

The most recent episode of Tacoma FD covered something similar to this mixed with a messed up Christmas Carol.

dreamcompiler 2 years ago

> ... Tom Waits, LeBron James, Knuckles, and, uh, Adolf Hitler.

I can't figure out if this is an example of Godwin's Law or not.

satvikpendem 2 years ago

What's the best open source text to speech? Eleven Labs and others are interesting but closed source. I want to use them mainly for audiobooks as I have a lot of ePubs and I'm just using the basic Google text to speech voices on my Android, via Moon+ Reader. It works fine but it's still more robotic than state of the art.

  • entrepy123 2 years ago

    POST-EDIT, CORRECTED ANSWER

    I doubt it's currently actually "the best open source text to speech", but the answer I came up with when throwing a couple of hours at the problem some months ago was "ttsprech" [3].

    Following the guide, it was pretty trivial to make the model render my sample text in about 100 English "voices" (many of which were similar to each other, and in varying quality). Sampling those, I got about 10 that were pretty "good". And maybe 6 that were the "best ones" (very natural, not annoying to listen to, actually sounded like a person by and large), and maybe 2 made the top (as in, a tossup for the most listenable, all factors considered).

    IIRC, the license was free for noncommercial use only. I'm not sure exactly "how open source" they are, but it was simple to install the dependencies and write the basic Python to try it out; I had to write a for loop to try all the voices like I wanted. I ended using something else for the project for other reasons, but this could still be a fairly good backup option for some use cases, IMO.

    PRE-EDIT, ERRONEOUS ANSWER

    Same as above, but I had said "Silero" [0, 1, 2] originally, which I started trying out too, before switching to a third (less open) option.

      [0] https://github.com/snakers4/silero-models#text-to-speech
      [1] https://silero.ai
      [2] https://github.com/snakers4/silero-models#standalone-use
      [3] https://github.com/Grumbel/ttsprech#usage
  • lhl 2 years ago

    For neutral sounding very fast/efficient voices, I find Coqui TTS VITS models to be very good. For slower, more expressive voice or voice cloning I think the Coqui TTS XTTS is good (or you can look at the mrq/tortoise-tts).

    I'm still awaiting a StyleTTS2 implementation. The audio samples sound top notch: https://styletts2.github.io/

    • modeless 2 years ago

      You're in luck, the code dropped 6 hours ago :) https://github.com/yl4579/StyleTTS2

      Looks promising, I'm going to check it out too! MIT license, even! If it's fast enough for real time, it could be the new best option. The paper claims faster inference than VITS...

      • lhl 2 years ago

        Ha awesome! I just checked the repo literally before I posted and it was still empty, thanks for the heads up, will give it a spin now.

        • lhl 2 years ago

          Just a followup for those interested, inference implementation notes and comparison clip between StyleTTS2, TTS VITS, and XTTS: https://fediverse.randomfoo.net/notice/AaOgprU715gcT5GrZ2

          • modeless 2 years ago

            Wow you got it working so fast! I'm still stuck in package manager hell trying to debug a million little issues.

            • lhl 2 years ago

              In my post I link to my issue where I outline what I needed to do from a clean mamba env that might help.

              Pytorch nightly (I use for cuda-12) doesn't work w Python 3.12, but if you stick w 3.11 or 3.10 you should be ok. Rest was just w/o version numbers if you're on a clean venv should be fine, however there's a bug in the Utils lib that requires a 1-line fix if you're trying to inference (also linked). nltk was the only dependency not listed so not bad compared to most code drops!

              • modeless 2 years ago

                I spent a couple of hours debugging why jupyter's debugger wasn't working right, so not exactly related to the code. I did also find and fix that utils bug you mentioned. But my current issue is that phonemizer won't find espeak even though I set the environment variables that are supposed to work. I'll figure it out eventually...

                Thanks for writing up your experience! Good to know it works! And it's fast!

                • Bilal_io 2 years ago

                  Are you on Windows? I've had the issue and was able to fix it by manually adding these system variables:

                    PHONEMIZER_ESPEAK_LIBRARY = c:\Program Files\eSpeak NG\libespeak-ng.dll
                  
                    PHONEMIZER_ESPEAK_PATH = c:\Program Files\eSpeak NG
                  • modeless 2 years ago

                    Yes I'm on Windows at the moment. I did try setting those yesterday but I must have made a typo or something. I'll try again, thanks!

                    Edit: Got it working, sounds really great and is super fast as advertised. Amazing! Just tried modifying the code to make it speak more quickly and it worked first try and still sounds good too! This is way better than using Coqui TTS. Just need a few more pretrained models and the voice cloning that was in the paper and this will become super popular very quickly.

  • NoMoreNicksLeft 2 years ago

    We bought the $300/month plan for a few months earlier this year... and you'd only get 40 hours of audio generation for that. It wasn't really sufficient to our needs.

    How many audio books is 40 hours?

    Also, while its voice cloning was truly amazing, every once in awhile the voice would get a little nutty and sound like an insect just flew down their throat, or maybe they had an LSD flashback. Normal normal normal then it's some Bobcat Goldthwaite skit. And if you dialed down that parameter (I think it's called stability?) then it goes monotone really quickly.

    We're probably several years out from it being something people use personally for audio books.

    • echelon 2 years ago

      > We bought the $300/month plan for a few months earlier this year... and you'd only get 40 hours of audio generation for that. It wasn't really sufficient to our needs.

      All of these AI as a Service (AaaS?) API companies are going to race each other to razor thin margins. Immediately after ElevenLabs raised, five other TTS services raised nearly the same amount of money.

    • dylan604 2 years ago

      >How many audio books is 40 hours?

      Are you reading War & Peace or Cat In The Hat?

      • Jeff_Brown 2 years ago

        I always assume 200.to 250 pages per book when someone talks about large quantities of books.

        • satvikpendem 2 years ago

          That's fairly short. I read about 100 books a year and it includes thousand page tomes like The Count of Monte Cristo.

          • dylan604 2 years ago

            I always assumed that book to be rather short since it just needs to be a number of sandwiches eaten.

            100 books/year. That's an impressive feat regardless the number of pages. Are these downloaded ebooks or physical printed copies of books?

            • satvikpendem 2 years ago

              It's mostly audiobooks, I have some ePubs that don't have audiobooks anywhere, such as many Japanese light novel fan (or official) translations into English for example. I can get through them as I can understand audio faster than I can read text, as I play back at 3 to 5x speed.

              • dylan604 2 years ago

                what's your retention/comprehension of the content at those speeds? i find that those speeds allows me to understand the concept as it's whizzing by, but the retention of it is not good. everything i've ever been taught and personal experience about long term retention all say speed is not the most conducive.

                • satvikpendem 2 years ago

                  Retention is pretty good but that's because I've been training myself for the past 5 to 10 years to get to that speed. It's similar to how blind people's TTS are incomprehensible to most hearing-able people.

      • NoMoreNicksLeft 2 years ago

        I like to read with my eyes, not listen. I honestly have no idea how long an audio book is, hours-wise.

        I've seen a few for download, and they're always like hundreds of meg, if not over a gig. And that's in mp3, where it should be compressed heavily.

        • squeaky-clean 2 years ago

          In my audible library, the shortest is the first Hitchhiker's Guide to the Galaxy a 5h51m. The longest is The Power Broker at 66h9m. Most of the books I have are in the 15-25 hour range, but I also have a lot of fantasy stuff that gets near 50 hours (Game of Thrones, Brandon Sanderson...).

          • NoMoreNicksLeft 2 years ago

            Well, then we're talking $300 to have ElevenLabs do a single GoT book, but maybe as many as 8 books for HHGTG-style stuff.

            That's just not good value. Was sort of my point.

  • modeless 2 years ago

    I've tried a few, not an expert, but I think Coqui's new XTTS models are decent performance and quality wise (just in terms of how the speech sounds, can't speak to the voice cloning fidelity as I don't care about that). Open source code but non-commercial license for the model. They also have a bunch of models with more permissive licenses that aren't as good.

    I doubt they're better than Google's TTS though.

  • ticulatedspline 2 years ago

    Bark seems pretty good

    https://github.com/suno-ai/bark Demo at https://huggingface.co/spaces/suno/bark

    In the couple samples I tried it was substantially better at picking up meaning compared to VALL-E-X

  • follower 2 years ago

    > What's the best open source text to speech?

    I haven't re-evaluated OSS TTS options for a few months but from my own experience earlier in the year I've been pleased with the results I've gotten from Piper:

    * https://github.com/rhasspy/piper

    I've primarily used it with the LibriTTS-based voices due to their license but if it's for personal local use you can probably use some of the other even higher quality voices.

    The official samples are here: https://rhasspy.github.io/piper-samples/

    Here's a small number of pre-rendered samples I've used that were generated from a WIP Piper port of my Dialogue Tool[0] project: https://rancidbacon.gitlab.io/piper-tts-demos/

    While it's not perfect & output quality varies for a number of reasons, I've been using it because it's MIT licensed & there's multiple diverse voice options with licenses that suit my purposes.

    (Piper and its predecessors Larynx & Mimic3 are significantly ahead of where other FLOSS options had been up until their existence in terms of quality.)

    [0] https://rancidbacon.itch.io/dialogue-tool-for-larynx-text-to...

    ----

    Edit to add links to some of my notes related to FLOSS TTS, in case they're of interest:

    * https://gitlab.com/RancidBacon/notes_public/-/blob/main/note...

    * https://gitlab.com/RancidBacon/notes_public/-/blob/main/note...

    * https://gitlab.com/RancidBacon/notes_public/-/blob/main/note...

  • artninja1988 2 years ago

    Would also like to know this. Can't seem to find an open source tts engine that works on mobile to read muh books

hinkley 2 years ago

> Artifacts aside, it sounds like Michael Jackson doing a Weird Al impression?! Every line has a distinctly “white and nerdy” vibe: it loses any seriousness and edge, exaggerating words for comic effect and enunciating lyrics really clearly so the punchlines can be heard.

No, it sounds like someone doing doing an impression of Weird Al doing an impression of Michael Jackson. Someone whose mom told them they were special and they believed it.

These examples are standing on a ridge line, surveying the uncanny valley and looking for the best way to cross.

  • blagie 2 years ago

    ... they're good enough.

    I have an accent. If not for that, I'd be a great presenter.

    If I could translate my voice into a poor Neil deGrasse Tyson, a poor Patrick Steward, a poor Carl Sagan, a poor Morgan Freeman, etc., my presentations would be... better.

    • hinkley 2 years ago

      If it makes you more comfortable and confident, that is helping you.

      This isn't autotune for the spoken word, though. It's not fixing pacing or vocabulary, and in the audio above it isn't even fixing intonation. Yes, a thick German accent will give you away as being of German extraction. But you're also using the word 'since' when Brits and Americans would use 'for', and it's not going to fix that. Any more than it'll fix my french when I make the exact same mistake going the other direction (for=duration vs for=purpose vs for=interval). If I hear 'since one month' you're likely German or Indian. If you ask how long I've been in Marseille you'll know I'm American in about half that time.

    • totetsu 2 years ago

      Finally a way to not have to fix societies Prejudices just give everybody the tools to emulate the ideal of perfection no matter what color their skin or what their accent sounds like.

Calamitous 2 years ago

Key takeaway:

> No current artificial intelligence is powerful enough to hide the weirdness of Weird Al.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection