Settings

Theme

AudioPaLM: A Large Language Model That Can Speak and Listen

google-research.github.io

119 points by ml_basics 2 years ago · 36 comments

Reader

ml_basicsOP 2 years ago

> We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2.

Direct link to demo video showing speech-to-speech translation: https://google-research.github.io/seanet/audiopalm/examples/... (see website for more example)

ot 2 years ago

Impressive that it translated "Morgenstund hat Gold im Mund" (morning hour has gold in the mouth) to the equivalent English expression "the early bird gets the worm", instead of going for a literal translation.

I wonder though how much the text in the video was editorialized. For example, I doubt that the model would have correctly capitalized PaLM.

  • famouswaffles 2 years ago

    Bilingual LLMs make less Literal Translations where appropriate.

    https://arxiv.org/abs/2305.16806

    • ksaj 2 years ago

      And this makes sense, as some sayings would be unrecognizable word salad in other languages and cultures. Early Arctic birds don't catch many worms.

      Gold in the mouth is something popularized by rappers (grills) back in the 90's, so that doesn't translate well at all for me.

  • bamboozled 2 years ago

    I actually really liked the literal translation, I thought it was cool even though I'd never heard it before, "oh well"...

    • ksaj 2 years ago

      Since it is an LLM, you should be able to ask it for a literal translation if that's what you want.

criddell 2 years ago

For some reason I’ve been getting 12-20 spam calls per day (all for the same Medicaid/Medicare scam). I’m on T-Mobile which was one of the first carriers to roll out STIR/SHAKEN and I have their Scam Buster app installed and they are getting by all of that. It’s frustrating.

When I read about things like AudioPaLM, my first thought is of all the people in these call centers who seem to uniformly have pretty hard Indian accents and very American-sounding names (George Bush called me the other day!). Their days of working in a call center are numbered and their replacement is going to be a machine that is way cheaper to employ and better at the job.

  • pessimizer 2 years ago

    I'm more worried about the Philippines. Call center work is supporting the lower (younger) end of an educated bilingual middle-class there just as in India, but India has had more time to develop more options for those people than the Philippines has had.

    • ChatGTP 2 years ago

      The goal for business is to basically replace everyone with a computer, so I'd be worried about, everyone :)

      But actually, what is interesting to think about is that the desire to learn English will likely start to diminish from this. If there is little gain to learning it, like, the computer will just take your job, would you still bother?

      I mean some will remain interested, but many won't.

      • Jeff_Brown 2 years ago

        I live in Colombia and can testify that the demand to know English is enormous, and dwarfs the subset of demand for which call center work is responsible. Programming languages are documented in English. Scientific papers are written in English. Jobs in America require English. International travel outside of Latin America is much easier if you speak English. Etc.

  • zoklet-enjoyer 2 years ago

    I get those all day everyday. Fun to mess with them sometimes. I usually tell them my name is Ben Chode and my birthday is April 20, 1969

    • criddell 2 years ago

      I just want there to be consequences for the abuse of the phone system and harassment that results. Nobody cares though.

      The phone company will change your number if you want. The FCC will let you report these - one call at a time.

      I actually thought about making an app to let me submit a report with a single click. If I started submitting 40-80 reports a week, would that get anybody’s attention? Would somebody at the FCC contact T-Mobile on my behalf and ask them to actually help me with this? Probably not.

      • zoklet-enjoyer 2 years ago

        Agreed. It basically makes my phone useless as a phone. I generally don't answer unrecognized calls unless I'm up for messing with a scammer, because that's who is usually calling. When I am expecting a phone call that's a problem because I sometimes don't recognize the number so don't answer. And my voicemail inbox is often just filled with this garbage.

    • mdaniel 2 years ago
  • robterrell 2 years ago

    Earlier this week I got spam call that was almost certainly an AI-generated human voice.

rhogar 2 years ago

Though inference for the 8B model is almost definitely not capable of near real time inference yet, we’re approaching babelfish territory. Main difference perhaps being this is powered by burning massive amounts of carbon as opposed to a fish brain.

  • gwern 2 years ago

    > Though inference for the 8B model is almost definitely not capable of near real time inference yet

    Google previously showed you could get the fullsized 540b-parameter PaLM-1 model down to "a low-batch-size latency of 29ms per token during generation (with int8 weight quantization)" https://arxiv.org/abs/2211.05102#google . How many tokens per 1000ms do humans speak? I'm guessing fewer than 34. The real question is who wants to pay for it.

Kinrany 2 years ago

I wonder if it can translate from English into English Spoken By Five Year Old

zb3 2 years ago

Hey Google, what about finally giving me the access to MusicLM?

villgax 2 years ago

What a joke, 8Billion parameters to gain 1 percent compared to 1.5B of largest Whisper model

ChatGTP 2 years ago

I can't wait till everyone is using this and we have absolutely zero idea whether or not it's actually translating things correctly or using it's own interpretations of things, going to be...awesomeeeeeeeee!

  • famouswaffles 2 years ago

    Sota Bi/Multilingual LLMs with good enough representation of the languages (takes much less data than you'd think) are human level translators. Hallucinations on tasks like Summaries, translations etc are near non-existent.

    • ChatGTP 2 years ago

      Thank you for reminding me that it's going to be, awesommeeee.

      Curious, do you speak more than one language?

      Edit: I just had a look at your comment history, do you realize you're like, incredibly pro LLM? Do you just scour HN looking for LLM articles and comment on them in a positive way? Not having a poke it's just interesting how keen you are.

      • famouswaffles 2 years ago

        Yes. And although it's not the language I'm familiar with, I tested GPT and GLM-130b on Mandarin also.

      • hfhdjdks 2 years ago

        Are you american by any chance?

        Over here people speak multiple languages. I doubt we'll run out of people that speak multiple languages just because there's a language model that can do great translations.

      • famouswaffles 2 years ago

        Your comment history is fairly LLM skeptic. I'm not sure what that has to do with anything. The only difference in this instance is that I've actually tested GPT-4 on translations while you haven't.

        If you're going to rag on a product's capabilities on x, you'd think the least you could do is use it for x first.

        • ChatGTP 2 years ago

          How on earth do you know people have or haven’t done?

          Are you spying on everyone ?

          • famouswaffles 2 years ago

            It's obvious you haven't lol. Your comment reads like someone who hasn't and you never bothered saying you had but just didn't agree on the issue of uality. Even now, your defense isn't "but i have", it's "how do you know i haven't ?", a tell tell sign of someone who actually hasn't bothered.

      • blovescoffee 2 years ago

        I share the same sentiment as the original commenter and I speak more than one language. Why do you ask?

        • ChatGTP 2 years ago

          Because virtually everyone tests these things with two languages they're familiar with, else you couldn't really verify if it was correct or not. For languages you're not familiar with, you don't have the "mental mode" to talk using a translator, that is there is more to this than just "talking", there is cultural norms, local dialects, slangs etc which are to be respected when learning and speaking languages with native speakers. When a person who speaks English and Italian tests these things. They know what they're in for an compensate a bit.

          Google translate screws up for me really, really hard sometimes when I'm speaking Korean but I'm already a pretty strong speaker, native so I know how to work with the screw ups...and laugh about the really bad ones. I'm not going to go into a meeting and blast off with an auto-translator without understanding what I'm saying or have someone to make sure I'm saying the right thing by talking with them first.

          I personally wouldn't feel comfortable using something like this for anything of real significance, a really good translator can ensure the message gets delivered.

          • seanthemon 2 years ago

            Do you use google translate for anything of significance?

            • ChatGTP 2 years ago

              Only because I mostly understand the language I'm working with. This is why I know about these problems.

              I'd never just go to somewhere exotic and rely on it for anything significant based on my existing experience with these technologies.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection