200ms Voice LLM

93 points by shubhamgoel 2 years ago · 23 comments

Reader

I hope this catches on and can be transplanted into any LLM soon. Instant voice integration is poised to unlock many UX things like auto-pilot in cars making you a real powerful co-pilot where you can suggest fine-tune settings on the fly for the current situation or anything that changes lots of settings at once with visual feedback.

ricardobeat 2 years ago

Nio’s built-in assistant uses ChatGPT with Whisper (I think) and is almost real time already. It can change settings like that and rarely misses a beat.
I expect once GPT 4o becomes available it will be awesome (even more if they “unlock” it to be asked generic questions and hold a conversation).

demarq 2 years ago

That was amazing and so productive. Looking at the transcript I was able to cover so much more ground than if I was stuck typing all my questions with a keyboard. I can imagine a future in offices where people have Ai rooms where you go not for a meeting with other people but to have a convo with ai.

Gys 2 years ago

Cannot wait for an instant translator. Something that can translate synchronously (!) one language it hears into a language that I understand. Getting closer!

ofrzeta 2 years ago

I don't understand how that could possibly work. You will always need some time window, don't you? Also sometimes the translation of the first word can only be inferred when you process the last word in a sentence (simplified example).
- mosselman 2 years ago
  
  There are people who do this and that is synchronous enough. So indeed maybe no Star Trek level mouth-moves-and-translation-comes-out, but with 200ms we are getting close to being able to speak perfectly fine with someone who’s language we don’t speak.
  - ofrzeta 2 years ago
    
    I've experienced human live translators and more often than not they paraphrase a sentence as a whole after it is uttered. I couldn't imagine otherwise because you can't start with translating until the sentence is finished by the original speaker. With human translators there's the additional challenge to translate/speak while listing.
    
    mosselman 2 years ago
    
    Yes exactly and the EU has meetings this way, so it seems good enough already. I don't think in practice, let's say on holiday, you'd need anything faster than 200ms for the first token. Of course, more speed would probably mean better experience, but it is already well within the realm of fast enough. Accuracy matters a lot more.
- Gys 2 years ago
  
  You mean the job of a translator is by definition impossible?
  Edit: Yes, for sure there will be a delay of a few words or even one or two short sentences. Just like human translators. Not a problem I think.
  Edit 2: Very curious why my first comment was downvoted?
  - ein0p 2 years ago
    
    You can’t translate word by word. You need entire phrases, sometimes pretty long ones, to understand the meaning.
    
    woleium 2 years ago
    
    sometimes, but in a constrained situation, e.g. at a hotel reception desk, maybe not.
    
    nmstoker 2 years ago
    
    So what would you do if translating from a language that puts the main verb at the end of a sentence, into a language that doesn't...
    Feels like there will be plenty of cases you can't just get around.
    
    ein0p 2 years ago
    
    Not to mention that sometimes you need more than one sentence to get the full meaning
  - ofrzeta 2 years ago
    
    Obviously translation is possible but not "synchronously" as you wish. From what I understand 200ms is "time-to-first-token", so I still wonder how that works because, as I wrote in my comment above, typically there is no one-to-one correspondence of words/tokens from one language to another. (I didn't downvote your comment)

akreal 2 years ago

Cool!

HF Transformers is great for prototyping and research, but should not an interactive tool like this be based on something more speed-focused, like llama.cpp?

Any plans for languages beyond English?

juberti 2 years ago

We're running it on vLLM and are working with others in the community to bring it to other optimized inference frameworks.

morjom 2 years ago

In the same vein is there any good, low latency, speech to text to speech (STTTS?) capable programs that are making use of LLMs or AI?

mungoman2 2 years ago

Yeah this is awesome. Keep reducing that latency, that's the path to the killer assistant.

ilaksh 2 years ago

Your comment doesn't make any sense to me. 200ms is already extremely low latency.
mistrial9 2 years ago

Freudian slip ?
https://en.wikipedia.org/wiki/Freudian_slip

Settings

200ms Voice LLM

Keyboard Shortcuts