200ms Voice LLM
github.comI hope this catches on and can be transplanted into any LLM soon. Instant voice integration is poised to unlock many UX things like auto-pilot in cars making you a real powerful co-pilot where you can suggest fine-tune settings on the fly for the current situation or anything that changes lots of settings at once with visual feedback.
Nio’s built-in assistant uses ChatGPT with Whisper (I think) and is almost real time already. It can change settings like that and rarely misses a beat.
I expect once GPT 4o becomes available it will be awesome (even more if they “unlock” it to be asked generic questions and hold a conversation).
That was amazing and so productive. Looking at the transcript I was able to cover so much more ground than if I was stuck typing all my questions with a keyboard. I can imagine a future in offices where people have Ai rooms where you go not for a meeting with other people but to have a convo with ai.
Cannot wait for an instant translator. Something that can translate synchronously (!) one language it hears into a language that I understand. Getting closer!
I don't understand how that could possibly work. You will always need some time window, don't you? Also sometimes the translation of the first word can only be inferred when you process the last word in a sentence (simplified example).
There are people who do this and that is synchronous enough. So indeed maybe no Star Trek level mouth-moves-and-translation-comes-out, but with 200ms we are getting close to being able to speak perfectly fine with someone who’s language we don’t speak.
I've experienced human live translators and more often than not they paraphrase a sentence as a whole after it is uttered. I couldn't imagine otherwise because you can't start with translating until the sentence is finished by the original speaker. With human translators there's the additional challenge to translate/speak while listing.
Yes exactly and the EU has meetings this way, so it seems good enough already. I don't think in practice, let's say on holiday, you'd need anything faster than 200ms for the first token. Of course, more speed would probably mean better experience, but it is already well within the realm of fast enough. Accuracy matters a lot more.
You mean the job of a translator is by definition impossible?
Edit: Yes, for sure there will be a delay of a few words or even one or two short sentences. Just like human translators. Not a problem I think.
Edit 2: Very curious why my first comment was downvoted?
You can’t translate word by word. You need entire phrases, sometimes pretty long ones, to understand the meaning.
sometimes, but in a constrained situation, e.g. at a hotel reception desk, maybe not.
So what would you do if translating from a language that puts the main verb at the end of a sentence, into a language that doesn't...
Feels like there will be plenty of cases you can't just get around.
Not to mention that sometimes you need more than one sentence to get the full meaning
Obviously translation is possible but not "synchronously" as you wish. From what I understand 200ms is "time-to-first-token", so I still wonder how that works because, as I wrote in my comment above, typically there is no one-to-one correspondence of words/tokens from one language to another. (I didn't downvote your comment)
Cool!
HF Transformers is great for prototyping and research, but should not an interactive tool like this be based on something more speed-focused, like llama.cpp?
Any plans for languages beyond English?
We're running it on vLLM and are working with others in the community to bring it to other optimized inference frameworks.
In the same vein is there any good, low latency, speech to text to speech (STTTS?) capable programs that are making use of LLMs or AI?
Yeah this is awesome. Keep reducing that latency, that's the path to the killer assistant.
Your comment doesn't make any sense to me. 200ms is already extremely low latency.
Freudian slip ?