Ask HN: Why is there no high quality method for voice control of a PC?

90 points by 4midori 4 years ago · 123 comments (121 loaded) · 1 min read

Like many people who have spent decades behind a keyboard, RSI (Repetitive Stress Injury) prevents me from writing code and doing graphic design through the usual keyboard and mouse inputs.

So I have turned to a complex and highly unreliable software stack that provides both voice-to-text, and clumsy but limited control of Microsoft Windows, Chrome, etc. This includes Dragon Voice-to-text, Voice Computer, and Talon, plus a browser extension and heavy customization.

Users of Dragon will acknowledge that: a) The software is a creaky dumpster fire built on archaic code b) There is no viable alternative on the market

My question is: *how is it that no one has built something better?* The market is huge, and the Natural Language Processing of "OK Google" and Siri are quite refined at this point.

References:

Dragon: https://www.nuance.com/dragon.html

Voice Computer: https://voicecomputer.com/

Talon: https://talonvoice.com/

Klaster_1 4 years ago

Is there a high quality voice control for anything? I only ever tried Google Assistant and it mostly always can't comprehend queries beyond "timer 10 minutes", like putting a water boiler on schedule.

reaperducer 4 years ago

I think it depends on how you define "high quality."
To me, I'll take a restricted set of actions if they work very reliably.
This is what I had back in the 80's with a Covox Voicemaster plugged into to the joystick port of my Commodore 64. It could only understand a few phrases, but I could define those phrases, and it almost always worked.
If you define "high quality" as being able to respond to a seemingly infinite number of queries, but only understanding and replying correctly occasionally, then Siri is closer to what you want.
- dublin 4 years ago
  
  Totally agree. I too had the Covox voice recognizer on an SX-64 doing voice-controlled x10 (and other) home automation in the '80s. The amazing thing is that despite the fact that an RPi 4 has more power than a Cray had back then, modern voice recognition really isn't much better than it was then. (Although it was pretty speaker-dependent...)
  I have a handful of Echo Dots and Shows in places I don't mind the security risk, and they are maddeningly incompetent at doing anything in the real world other than telling the weather and acting as a voice-controlled radio (their main use...)
  It would be interesting to go back to the Covox approach and rebuild it for today's tech from the ground up (shouldn't need the hardware anymore...), as it worked surprisingly well on computers that had less resources than many (most?) of today's microcontrollers...
  - passerby1 4 years ago
    
    Probably it's not fully correct, but feels like the motivation behind Echo and similar is not home automation, but a feeding ad networks with the personal details about user.
ugjka 4 years ago

I don't remember the last time i had such rage when i tried to voice search for "French Horn Rebellion - This Moment"[1] on my android TV and i didn't want to type it in with the remote. I'm also not a native english speaker
[1] https://www.youtube.com/watch?v=4khlVbakV_Q
- herbst 4 years ago
  
  I feel you. My TV seems to understand whatever it wants. But for whatever reason the talk button is super present on the remote.
imglorp 4 years ago
GA is also tightly limited by business constraints.
```
  "Send a slack to my wife" -> "Sorry, who do you want to text?"
```
Multi-fail.

ahelwer 4 years ago

I've heard cursorless (https://github.com/cursorless-dev/cursorless-talon) is good but have never tried it. Syntax-aware voice navigation of code, powered by tree-sitter queries!

I also have a friend who is a gifted programmer who lost his ability to type about a decade ago; he has put together an open-source software stack to help: http://www.cs.columbia.edu/~dwk/

Of course this doesn't really answer your question. But it's a hard problem, and you're basically forced to become a power user to reliably interact with your PC.

pokeyrule 4 years ago

creator of Cursorless here. Happy to answer any questions
- jiehong 4 years ago
  
  Seriously cool project!
  This reminds me of easy motion for vim or ace-jump for emacs.
  Do you think it would be possible to have an on-demand contextual hat decoration?
  Like you say “show hats words” and only words get decorated with hats and you pick one. It would allow you to maybe show hats only on square brackets or only on function arguments, etc. I find the number of hats with colors a little bit hard to distinguish; should they were contextual, they would require fewer or no color.
  Do you map voice commands to keyboard shortcuts available in vs code, or directly via the apis? (Not sure if there is a difference in the end).
  Now I wish for a cursor less plugin on the IntelliJ platform.
  - pokeyrule 4 years ago
    
    Thanks jiehong!
    The reason that the hats are always present is that the way to code faster by voice than be keyboard is to speak fluently, minimising pauses, the way we speak regular human languages. If we had to say a command and then wait for the hats to appear, that would break the chain.
    Re mapping, we use something called the "Command server", which allows us to use file-based RPC to run commands in VSCode. That way it is easy to send more complex commands, which are required by Cursorless
    IntelliJ support is definitely one of the most requested features; once I'm done rewriting some of the core engine I'll probably take a swing at that. Here's the issue that tracks extracting cursorless into a node.js server so that it can be used by other editors: https://github.com/pokey/cursorless-vscode/issues/435
maxore44 4 years ago

I am a cursorless user myself. Using dictation software for programming is actually relatively fast when you get used to it, but editing code (which is how most of us spend the majority of our time) can be pretty slow. Cursorless was a huge productivity booster for me. It got me to switch from Emacs to VS Code which is saying something.
- alexhwoods 4 years ago
  
  Same. Have to disagree though. I've been reintroducing a keyboard here and there, and whenever I have to do something in the VSCode editor, I get frustrated with the speed and end up going back to Cursorless.
  I think it's a lot faster than keyboard / mouse, mostly because of how little moving of the cursor you have to do.
  Could be I was slow to begin with, not super efficient with vim or emacs.
  Also, "editing" is the fastest part for me, due to "bring" and "change". So little movement.
- jbellis 4 years ago
  
  What distinction are you trying to make between "programming" (fast) and "editing code" (slow) ?
  - pokeyrule 4 years ago
    
    I think the distinction is "programming" refers to just dictating some code from scratch. "editing code" refers to changing code that is already written.
    For the former ("programming"), there are many commands that can be used to rapidly output code. For example, a user can say "funky hello world" to get something like
    function helloWorld() {
    }
    But if you're trying to edit code that already exists, it can be a challenge to do so without a mouse and keyboard, in part because you need to do a lot of navigation.
    I personally believe Cursorless solves that problem better than a keyboard and mouse, but I have to imagine I'm a bit biased on that point :)
    
    maxore44 4 years ago
    
    Yes, exactly right. The vast majority of programming is editing code, and cursorless allows you to A) Navigate code quickly and B) do things using less navigation than you could with Keyboard/Mouse.

PaulHoule 4 years ago

Google and Siri are good at what they do. They aren't good at other things, such as dictation.

I see the big problem in voice interaction is that a human being will ask you questions to clarify what you said if they don't understand and current systems don't even try. (Actually the search paradigm lets you do some refinement, "Ok Google" works amazingly well on Android TV.)

Superhuman accuracy at dictation doesn't translate to a useful ability to understand text. You're doing great if you only garble 1 out of 20 words. Some errors are inconsequential, but if it garbles every other sentence then you are going to feel 0% understood.

remus 4 years ago

> Google and Siri are good at what they do. They aren't good at other things, such as dictation.
It's interesting you mention that google isn't good at dictation as I've found it excellent on pixel 6 (maybe the quality varies depending on what hardware you're running on?) If I need to write out anything over a sentence or two on my phone I'll almost always dictate it and as long as I have a reasonable idea what I want to say beforehand it works well.
What I personally find a little jarring is that I find I need to compose what I want in my head further in advance than I would if Im typing as correcting mistakes is more awkward.
Pxtl 4 years ago

Whenever I use the Google Assistant, I'm shocked by a) How good the speech-to-text is at figuring out my words, and b) How bad the application layer is at using those words
I tend to over-enunciate, so I don't get many bad bugs in the parsing... but that doesn't stop the Google Assistant from delivering completely the wrong response to the words that it's showing me it has correctly recognized, or simply spinning endlessly and locking up my phone.
As an industry, we suck at everything. We've solved the hard problem but failed the easy part of "once the command has been parsed, either execute the action or show the user an error and then close the dialog".
The only thing I find really awful about speech-to-text on Google is that it can't seem to detect punctuation.
- Noumenon72 4 years ago
  
  When "OK Google" first came out, I was so wowed and I was constantly going "OK Google, search whatever". Now I use the button to trigger it because it doesn't hear me, and I have to retry a lot of queries -- it just doesn't work as well. Perhaps they made it work great for white males at first but then had to accept a bunch of tradeoffs to get it working for everyone.
fouc 4 years ago

I would imagine GPT-3 or similar would be able to fix replace the garbled 1 out of 20 words with something that actually make sense in context.
- sburud 4 years ago
  
  Yes, sort of. Thing is, many modern speech models actually learn an internal language model, so we're already kind of doing that. In languages and domains where massive amounts of training data is available (say, grammatically correct English), this internal language understanding is so good you don't need the external model[1].
  On the other hand, throwing an additional language model like GPT and BERT into the mix can help if you don't have a ton of voice data. In my attempt to do this, a large portion of the improvement came from letting the language model read the previous sentences in the conversation[2]. AFAIK most commercial systems are blissfully unaware of your previous sentences, leading to conversations like "set an alarm"/"sure when?"/"eightam"/"your nearest ATM is...".
  A word of caution though: letting BERT/GPT edit the outputs also gives a (potentially) much more dangerous failure mode: if the speech signal is difficult to understand, the resulting transcript will be difficult for humans to identify as transcription failures.
  For example, "yeah, I dunno I haven't..." (read on a noisy phone line in an obscure dialect) was transcribed as "yeah yeah not that is I I am then" by the baseline speech system. After we let BERT edit the outputs, the transcript became "yeah that's not what I was saying...". Which, ironically, was definitely not what the person was saying.
  [1] https://arxiv.org/abs/1911.08460, page 9
  [2] https://arxiv.org/abs/2110.02267
  edit: clarify why previous sentences matter
- bittercynic 4 years ago
  
  That seems worse to me. If there's going to be a transcription error I'd prefer it to be obvious instead of just changing the meaning of the sentence.
- willcipriano 4 years ago
  
  How do you know what word is garbled?
  - robbedpeter 4 years ago
    
    Grammar and context. It'd be closer to dictation than current speech to text, with gpt serving as a "brain" interpreting what you mean in the current context instead of raw input. You could tie in the "natural language to [sql,bash,log parse, regex]" capabilities of gpt-3 and so on.
    Obviously it wouldn't be as good as a real person, but it'd be a nice leap to the 95%+ level of accuracy over the 80%ish on high performing commercial STT systems.
  - warrenm 4 years ago
    
    ...and how do you know which word you meant (even if it's not garbled)?
    The number of homonyms (and near-homonyms) in English in huge
    It's been a major issue for some users of W3W (eg https://cybergibbons.com/security-2/why-what3words-is-not-su...)

dataangel 4 years ago

I have an RSI and I've been coding by voice exclusively for about 7 years. I used a system built on top of Dragon for most of that and in the last year switched to Talon.

I think there are multiple reasons:

* The obvious market is dictation of natural language, but this isn't what you want for voice control. If you try to use long descriptive phrases as your command language everything takes forever. So instead you end up making your own mini command language where all of your common actions are a single syllable, but now it's no longer the English or other natural language that users already know. So now your product has substantial learning curve just like learning a new keyboard layout.

* Everything other than talon has terrible latency. Most existing speech recognition engines were not designed with the kind of latency you want for quick one syllable commands.

* In order for it to be really effective you need the cooperation of applications (this is why I've written extensive emacs integration). Some tools like window speech recognition try to hook in at the UI layer in order to figure out what text is in dialog boxes and such, but in practice they seem to do a pretty terrible job. Windows speech recognition has a very hard time consistently understanding what links you are trying to get it to click on for example. There's also a long tail of applications that just do their own custom UI rendering inside a blank canvas where no hook is possible.

* Good speech recognition even if not specifically targeting computer voice control is a genuinely hard research problem, and standard benchmarks for accuracy are misleading. You see "95% accuracy" and you are like wow that's a high percentage computers almost have this speech recognition thing solved and then you think about it harder and you go wait a minute, that's one mistake every 20 words! Maybe you are still impressed, but then you have to take into account that when the computer does the wrong thing you'll need to issue more commands in order to correct it, which will are also likely be misinterpreted. When you make a typo with a keyboard the mistakes rarely cascade, you just hit backspace.

daanzu 4 years ago

"Everything other than talon has terrible latency": False! I develop kaldi-active-grammar (https://github.com/daanzu/kaldi-active-grammar), a free and open source speech recognition backend, which has extremely low latency. You can adjust how aggressive the VAD (voice activity detection) is to suit your preference, but the speech engine latency can be almost negligible, especially for voice commands (vs prose dictation). However, I agree that "most existing speech recognition engines were not designed with the kind of latency you want for quick one syllable commands", and that low latency is pivotal to being productive with voice commands. I also agree with your other points.
- tdj 4 years ago
  
  I built a similar app using a Kaldi's nnet3 model running embedded; the thing was so responsive that our demo to an SVP went sideways: when he gave a query, the app responded nearly immediately after the sentence ended. The SVP did not realize it already responded, as the expectation for voice interaction systems was that it takes like 2-5 seconds to get an answer, which made the impression that the system did not work properly.
  So, moral of the story, if you do a too good job of making a fast speech engine, especially for multi-turn dialogues, add some delays so it resembles human dialogue more.
- dataangel 4 years ago
  
  Sorry, should have said everything I have tried :)
  At some point when I have enough free time I will have to take a look at this! Thanks for putting time into this kind of thing!

floatingatoll 4 years ago

Siri can’t understand “set a timer” more often than 3 in 4 tries for me, and any sentence with more than four words will have one error in it no matter what. I envy you the accuracy your voice assistants offer you, but for me, voice control makes me want to snap my phone in half from frustration at how terrible it is. I still can’t remember why I have a reminder set with the name “2910”, which is the transcription of my spoken English sentence at the time. So at the very least, I imagine the holdup is that voice control failure conditions are miserably bad, when it fails; and, “Delete this sentence” -> “Formatting C:\” misunderstandings are too easy in modern OSes still. (Windows still offers “Format” as a primary context menu choice on the boot hard drive!)

newsbinator 4 years ago

I use "wake me up in x minutes". Siri always understands that 100%.
So to set a kitchen timer: "wake me up in 11 minutes"

kbenson 4 years ago

For anyone interested in this topic, you might be interested in this tech talk[1] from Emily Shea. In it she demos a tech stack similar to what's mentioned here, to fairly good effect. It does appear that required a lot of tweaking on her part and is optimized for the writing of code, and I'm not sure how well it functions for more general contexts.

1: https://www.youtube.com/watch?v=YKuRkGkf5HU

WorldMaker 4 years ago

The last time I tried Dragon it was just a fancier (bloatware) UI built directly on top of Windows Voice Recognition (and IMO not adding much value on top of it): https://support.microsoft.com/en-us/windows/use-voice-recogn...

Windows Voice Recognition has been around forever (out of the box since XP), it's UI is "serviceable" but not great. (It was slightly better when Cortana was briefly "out of the box" in Windows 10, but has reverted some since.) But I don't think you need to pay for Dragon (or its high memory consumption) if you don't mind taking to learn the quirks of Windows Voice Recognition directly. Most of Dragon's quirks are Windows' quirks anyway papered over with a UI that makes it seem like they are adding value.

Also yeah, one of the answers to "how is it that no one has built something better?" is: Well, Microsoft tried with Cortana, got a huge blowback that "no one" wanted Cortana on their PCs, and gave up.

Stevvo 4 years ago

Not sure where you got that idea. Dragon predates Windows and has always used its own models.
It works very well for some people; many have written books with it.

calchris42 4 years ago

Wow, so many replies that boil down to “because typing is better, you should just type”.

This is fairly insulting as RSI’s are very much a real thing.

Does this community also think that wheelchair ramps should never be invested in because stairs are clearly superior?

I’d rather see the brain power in this community focused on solutions. Keyboard + mouse have lasted so long because they work surprisingly well, but I hope there is a day that we dream up something better that does not require slowly giving ourselves carpel tunnel.

daanzu 4 years ago

I have been coding entirely by voice for approximately 10 years now (by hand long before that). Most of that time I have been using the Dragonfly (https://github.com/dictation-toolbox/dragonfly) library to construct my own customized voice coding system. The library is highly flexible and open source, allowing you to easily customize everything to suit what you need to be productive. It is perhaps the power user analogue to Dragon Naturally Speaking. With it, you can certainly be highly productive coding by voice. However, it does require work to setup and customize to suit you, so it isn't really for the "general population" of computer users to just sit down and use. With regard to accuracy of speech recognition, being open allows you to (with sufficient motivation) to train a custom acoustic speech model that recognizes your voice specifically extremely well.

Regarding the software packages you referenced: Yes, Dragon is trash that I want nothing to do with, because of its inefficient interface, its complete inability to accurately understand my voice, and its generally shoddy software quality. Voice Computer (which I hadn't seen before) is therefore eliminated as well, though it doesn't look terrible as a front end to Dragon to better use the OS GUI-accessibility info. Many people like Talon, but I demand something open, which I can modify to suit my needs.

Background: I develop kaldi-active-grammar (https://github.com/daanzu/kaldi-active-grammar), a free and open source speech recognition backend usable by Dragonfly, itself entirely by voice. There's also a community of voice coders using Dragonfly and other tools that build on top of it, such as Caster (https://github.com/dictation-toolbox/Caster).

newusertoday 4 years ago

what sr engine do you use for your personal setup? is it kaldi?(assuming you helped developed it :-) ) .
- daanzu 4 years ago
  
  Yep, I have been using my Kaldi backend through Dragonfly exclusively ever since I got v0.1.0 working.
  I bootstrapped writing it initially using the Dragonfly WSR (windows speech recognition) backend, because that gave me the best accuracy out of the available options at the time. All of my development of it since the initial working version has been done using each previous version, so now it is basically bootstrapped itself. My productivity skyrocketed once I switched to Kaldi, due to being able to use my custom trained speech model just for my voice for orders of magnitude better accuracy, plus dramatically lower latency. (And it freed me from being dependent on closed software out of my control.)
  I bootstrapped my personal speech model by retaining the commands from me using WSR. My voice is quite abnormal, and it took only 10 hours of speech data to train a model dramatically more accurate than any generic model I've ever used. And of course, I retain much of my usage now with Kaldi, so my model improves more and more over time. A virtuous flywheel!

phkahler 4 years ago

I would like to see Linux lead here. Have a standard voice interface where a voice-to-text process feeds a stream of text to the DE, which can then forward it to the active application (as text). I want this to be a separate "voice" stream so it is not confused with the keyboard. This would allow the eventual creation of a voice assistant at the system level, but also allow individual applications to adopt voice commands starting now. IMHO this should be like version 1 of the concept and it should last a while until we figure out what all is possible and which use-cases need a design change.

Simple dictation could be done at the DE level, where the VtoT stream would be diverted to the keyboard input of the active app. It could also be done at the app level, but this is one feature I think belongs a level up so it can be used by non-voice enabled apps.

bool3max 4 years ago

Who do you expect to actually work on this? Billion dollar companies can't get voice controls right. FOSS DEs struggle with keyboard/mouse input, let alone voice.

laserbeam 4 years ago

The problem is, even if you do build amazing speech to text, it will br slower and less expressive than a keyboard + pointing device (mouse, touch, pen).

For keyboards, you lose positional logic (wasd in games). You lose shortcuts. You lose control over capitalization and formatring. You lose punctuation. You lose non-text input (code, dictating code sounds like like a horrible pain). You lose function keys. And, of course, you lose speed (think of instant things you do with shortcut keys, like alt tab). Not to mention, that you lose the ability to work in silence.

Make the recognition quality gorgeous, and it will still be a less flexible product than what we use today. It has value for accessibility, but people will likely choose keyboards over dictation based on UX alone.

mikob 4 years ago

We can choose what's best based for the task at hand. In the same way most people don't use the mouse to click an online keyboard, most people won't use voice control to type WASD in-game.
Dictation, for instance, is an easy-win for voice input. Clicking buttons can be more convenient with voice when we're talking to Smart TVs or, perhaps, if our hands have pizza grease all over them and we don't want to touch the keyboard.

GeeJay 4 years ago

Voice control of ordinary computer navigation and of program writing and testing has gone nowhere in 30 years. https://dilbert.com/strip/1994-04-24

MattGaiser 4 years ago

Part of the problem is that context is something AI is bad at and instructions are highly context dependent.

https://www.youtube.com/watch?v=FN2RM-CHkuI

boomka 4 years ago

There are some tools, I think the reason they will never become widespread or high quality is that voice is just not a great medium for conveying that type of info in the first place. If I type a sentence and then decide to make a correction it is very difficult to explain in words but very quick to click and retype. If I want to position my window somewhere, I wouldn't even want to start thinking about how to explain it, I would just click and drag. And so on and so forth. This limits any potential markets for such tools greatly, so there is little economic incentive to develop them into anything truly high quality.

nosianu 4 years ago

Voice input is good for high level tasks and goals, requiring a high level comprehension.

For detailed work though the more direct method of translating movements is far more efficient.

When you can describe an abstract end goal voice is great. When you have to actually do all the individual steps towards some high level goal then it's like telling a newbie programmer through some high level database optimization. You only use voice because your main goal here is to teach someone. If the PC could be taught that way, then voice would be in demand for such tasks too.

sapiol 4 years ago

Offtopic: Hi, I saw your comments in some older thread about chelation (Cutler Protocol). I too am from germany and have some questions about your chelation protocol. Unfortunately, I can't reply anymore on that other thread. Can you contact me at 1u3_2d227vh7iadt@byom.de ?
sapiol 4 years ago

Offtopic: Hi, again. That didn't work as there is a 30m time-limit on byom.de. Can you please contact me again, but this time here: D-8ynpb9p087ukef2v@maildrop.cc
- nosianu 4 years ago
  
  Just get a regular account instead of those public services. Some random name at gmx.de for example. With the "d-" mail never appears, without it it works but the mail body was "undefined" when I checked what that service showed. I deleted it again.
  I also have such a throwaway-but-real account in my "About" under my user name here, just added it. Should have done that anyway, you just reminded me that I should.

6gvONxR4sf7o 4 years ago

Once automatic speech recognition (ASR) gets closer to bullet-proof, I expect this to become a huge thing, but right now, it seems like you're getting better error rates than typical.

Any input method where you frequently have to repeat yourself and undo things won't get mainstream. I'd bet people's mainstream tolerance for errors would have to be like one per five to ten minutes before you could get them to really adopt something like this (barring disability reasons, like RSI). Until then, the tech and market don't match.

wmf 4 years ago

The problem is that bulletproof speech recognition will only be available as a cloud service and maybe only wrapped in a Siri-style "assistant" UI. You probably won't be able to use it to replace things like Dragon.

viro 4 years ago

... because it's not a good way to control a computer?

adolph 4 years ago
> ... because it's not a good way to control a computer?
This comment speaks to a perception problem for aural methods. The state of the mainstream art doesn't seem much past Forstall's demo of 10 years ago. [0] Are generations of people accustomed to WIMP UI able to wrap their heads around a much smaller interaction set? [1]
Gentner and Nielsen's work described in "The Anti-Mac Interface" [2] speaks to some of the differences people will have to mentally bridge such as:
```
  Mac | Anti-Mac
  Direct Manipulation | Delegation
  See and Point | Describe and Command
  WYSIWYG | Represent Meaning
  User Control | Shared Control
  Feedback and Dialog | System Handles Details
  Forgiveness | Model User Actions
```
0. https://www.youtube.com/watch?v=SpGJNPShzRc
1. https://en.wikipedia.org/wiki/Post-WIMP
2. https://web.archive.org/web/20120904231532/http://www.useit....
falcolas 4 years ago

Why not? "Open hacker news", "go back", "Play liked playlist in Spotify"
Seems fairly reasonable. It need not be the only way, but not having to use my mouse to do stupidly simple tasks wouldn't break my heart.
- quartesixte 4 years ago
  
  The problem I think comes from a gap between distribution of levels of efficiency for computer-human interfaces.
  Take “Open Hacker News” for example. One user might Click Browser > Open bookmarks tab > “Hacker News”.
  Another, having set up a series of hotkeys, will go (on a windows machine, taskbar set for Browser pinned in position 1):
  Win+1 > Ctrl+3
  That is incredibly fast, much faster than saying it.
  My guess is that much of the software engineering world is either users who can do the first very quickly or don’t find it cumbersome, or users who set up hotkeys like the latter and will outrace the speed of human speech on any given day. Thus the problem gets little attention.
- viro 4 years ago
  
  My first guess would be "Open hacker news" requires clear audible speech. While the KB method just requires pressing 'h' and 'enter'. Also, non-cloud speech recognition just recently got decent.
  - falcolas 4 years ago
    
    I pressed 'h', and it didn't do anything.
    Context matters with such shortcuts. Even if I make sure that I have the location bar selected, focused and clear, 'h enter' takes me to a completely different website - because hacker news is actually 'news.ycombinator.com', so it's not the default by typing'ctrl-k h enter'.
    And let's be frank. Even if we do get voice control, it's not going to somehow take our keyboards and mice away from us.
    
    contextfree 4 years ago
    
    In principle there could be voice shortcuts; currently there seems to be an expectation that voice interfaces should be entirely limited to natural language words and sentences, but if we're willing to let this constraint go (at least for "power user shortcuts") and just design bespoke syllables in IPA or whatever, we could probably come up with something more efficient.
    It also ought to be possible to specifically design interleaved voice+keyboard, voice+mouse, voice+touch, voice+pen, etc. interactions that could be more expressive and efficient than either input method by itself.

mertd 4 years ago

As far as human computer interfaces go, keyboard and mouse probably win comfortably in both bandwidth and latency against speech to text in almost all tasks. Former also requires a less physical effort and is creates less noise for others. My guess is that this shrinks the demand for good quality voice HCI significantly and those who really need it end up being overlooked.

mikob 4 years ago

You're limiting your thinking to the paradigm of visual interfaces paired with a mouse and keyboard. When all you have is a hammer...
Here's some examples where bandwidth and latency wins with speech:
1. "Play here comes the sun" vs. opening spotify, waiting, clicking the search box, typing here comes the sun, pressing enter, waiting, scanning the page and clicking the right song.
2. "Send email to John asking him if he would like to Play golf" vs. opening Gmail, waiting, clicking compose, start typing john, click the right email, tab to subject... etc.
There are cases where keyboard and mouse input is better... e.g. editing text, graphics production and editing, etc.. But certainly not in "almost all tasks" as you say. I think speech is the 3rd big computer interface that complements the mouse and keyboard and will make computers more productive and convenient for everyone regardless if you have a disability.
- warrenm 4 years ago
  
  > 2. "Send email to John asking him if he would like to Play golf"
  Which John? Which of that John's contact points you have saved?
  ..and why don't you have the keyboard shortcuts for those actions committed to muscle memory by now?
D13Fd 4 years ago

Agreed. And it’s not just less noise, there is a privacy component to it. I don’t really feel like broadcasting what I am doing to anyone within earshot.
6gvONxR4sf7o 4 years ago

Keyboard and mouse certainly don't beat voice for bandwidth (assuming error-free ASR, which doesn't exist today).
- falcolas 4 years ago
  
  This guy's not wrong. You can speak clearly and comfortably at 250 words per minute. Most folks will type at less than half that.
  Even shortcuts (which peer comments are relying upon) aren't all that fast - they require additional selection movement with the keyboard or mouse before they can be used.
  - mertd 4 years ago
    
    People do much more than narrating natural language. They navigate menus, highlight text, launch apps, type commands on the terminal etc... I don't see how voice can best keyboard and mouse when considering all interactions.
    
    falcolas 4 years ago
    
    Strange, I can see it with no problems. Probably because I use VIM quite a bit, which makes use of fairly natural language gestures.
    Copy two words
    Select line
    Paste before word
    etc.
    Opening apps is ever simpler: "open spotify". Compare the complexity and time required to say those two words against moving your hand to the mouse, moving the mouse to a 100x100 pixel target, and clicking twice within 100ms. Even compare it against using "Cmd-Space Spotify".
    It'd require a learning period, but so does - for example - teaching the concept of the mouse to someone who's only ever used a tablet.
    EDIT: And I'll copy this from another of my posts - getting good voice control won't take our keyboards and mice away from us.
- warrenm 4 years ago
  
  Sure they do: `cp file1 file2`
  Vs properly enunciating "Kah-Pee f-i-l-e-1 to f-i-l-e-2"
  - 6gvONxR4sf7o 4 years ago
    
    When I did hands-free coding, I named my variables things that I could say as words. So you'd be saying 'copy file-num-one file-num-two' or something, rather than spelling it out letter by letter. I actually ended up naming things more verbose names because I didn't have to type it all out. So it might be:
    enunciating: 'copy snake-geary-street-financial-report snake-divisadero-street-financial-report'
    versus typing: 'cp gearyStreetFinancialReport divisaderoStreetFinancialReport'
    If you're trying to exactly replicate something designed (and named) for text input, you're absolutely right, but I thought we were talking about hypothetical designed-for-voice systems.
    
    warrenm 4 years ago
    
    tab completion handles goofy and long file names quite handily ... and lot faster than speaking
    
    6gvONxR4sf7o 4 years ago
    
    Tab completion relies on a limited context. If you're trying to type gearyStreetFinancialReport and the two names in context are gearyStreetFinancialReport and unrelated, you're right, but if there's a very large number of choices, it benefits you less. And new names aren't going to be in context, so even in the best case of my example, you're going to end up typing:
    'cp g-[TAB] divisaderoStreetFinancialReport'
    I'd expect that to be an advantage of voice stuff; that you can go fast in new kinds of large scope contexts, maybe even whole-machine context. A system designed from the ground up could exploit that in interesting ways.
    
    warrenm 4 years ago
    
    And now you also have camelcaps and other goofy spellings to worry about
    typing and shell help is always going to be faster than speaking
    `c g-[TAB] g-[TAB]` then replace the couple characters at the front with 'divisadero'
    there's no way you can do that faster speaking
  - falcolas 4 years ago
    
    I timed myself. 2.18 seconds to say it. Less time than it takes to type it.
    
    warrenm 4 years ago
    
    you must type slow ... because I can type it in under half the time you claim it took you to speak it

liveoneggs 4 years ago

GUIs are unsuitable to anything other than the mouse + keyboard. They are the outputs of their respective inputs.

You need dedicated software built on a hypothetical V(oice)UI to get anything decent.

Otherwise your best bet is to find a mouse/trackball/trackpad/pointerstick/touch-screen/pen that doesn't injure you and use text-to-speech in simple text editors.

SlogMaverick 4 years ago

I've kept this bookmarked for when this eventually happens to me. https://arstechnica.com/gaming/2019/04/coding-without-a-keys...

maxwelljoslyn 4 years ago

I'm in the same boat as you, OP. Talon has proven a lifesaver ... or at least it promises to be one. I'm still getting used to it.

My finding, for text dictation (not code), is that even halfway decent dictation, such as is available on iPhone, still needs much post-dictation editing. I feel that the biggest impact to be made in this area is superior capabilities for this editing phase.

I summarized and wrote up my thoughts as a grant proposal for Scott Alexander's recent "micro grants" project. Get in touch (email in my profile) if you want to read that, or if you'd like to talk about dictation, voice control, voice coding, and editing operations -- or just get some moral support.

maxore44 4 years ago

I have to plug the cursorless vs code extension to speed up code editing. It was a game changer for me.

ipnon 4 years ago

Speech models today can mine the entire corpus of published conversation and return the most likely response to a given statement. That's not how we converse. Every relationship you have is a little model in your brain that we call a person's "personality." Every one talks differently, has different frames of reference, uses different codes of language, different assumptions. Cutting edge speech models work perfectly for the perfectly average speaker, but that person does not exist! The farther we stray from the mean, the more alienating these speech models become.

kbenson 4 years ago

> Cutting edge speech models work perfectly for the perfectly average speaker, but that person does not exist!
This is a well known pit in statistics, I would think, given there are extremely famous stories about this exact issue causing deaths. In the 1950's, the air force was trying to figure out why their pilots were dying, and determined it was because their cockpit designs which used "average" pilots were a poor fit for almost ever real world pilot.[1]
1: https://www.thestar.com/news/insight/2016/01/16/when-us-air-...
cptaj 4 years ago

I have massive issues with speech recognition software. It doesn't work for me in either english or spanish. Statements like "google and siri are so advanced now" feel like people are collectively pranking me.
That said, I too have wondered why we don't have speech control for computers or at least appliances.
You don't need to parse all language. Just a standard set of primitives like you'd find on a remote should be way easier to recognize and can even be selected for their ease of parsing. Simple things like on, off, next, back, louder, etc.
- ipnon 4 years ago
  
  An interesting project: Automatically convert a terminal commands `--help` page to a speech model. Run that over $PATH, then you never have to type again!

smorgusofborg 4 years ago

If I had to program with audio, I would make a steno dictionary with a theory that results in a pronunciation that is sufficiently different from normal language and then speak it instead of chord it.

The complexity of doing that is IMO a good explanation of why commercial audio recognition is worthless to someone who programs a computer instead of interacts with humans over a computer.

http://plover.stenoknight.com/2013/03/using-plover-for-pytho...

mikob 4 years ago

I too noticed that Dragon is trash (2.2/5 rating on the Chrome Webstore, yikes) I've been working on one that's purpose-built for the web. Most software today is moving towards the web, so that's where we narrowly focus. It works everywhere (including HN, Reddit, YouTube, Gmail... even Duolingo)

You can DL it here: https://chrome.google.com/webstore/detail/lipsurf-voice-cont...

alexhwoods 4 years ago

Talon + Cursorless.

People have built the tools you're talking about. They're Talon and Cursorless.

I think you'd be shocked if you saw how productive some people in the Talon community are. Be sure to join the community Slack.

twright 4 years ago

Have you looked at Talon[1] for programming and system control? I used it for a few months last year and while the first two weeks were difficult I was able to nail down a workflow that really suited me. After another few weeks I felt as comfortable and capable working with it as I did a keyboard and mouse. (Cannot attest to its capabilities on Windows)

[1] https://talonvoice.com/

simonblack 4 years ago

JUST IMAGINE THE SCENARIO:

You have just been fired and as the security boys are escorting you to the door, you call out, loud enough to be heard in all the cubicles -

"Computer! Format all drives!"

OR MAYBE THIS OTHER SCENARIO:

The guy in the next cubicle has a loud voice and while he is commanding his own computer to "Exit the file without saving" you find that the work you have carefully constructed over the last four hours is suddenly thrown away too.

newusertoday 4 years ago

I tried using talonvoice but the recognition engine failed to understad lot of words. I then searched for pronunciation of those words on google and tolonvoice detected them correctly. In the end i learned to pronounce the words in american english so that talonvoice can understand them ;-) .Not what i was hoping for, i wanted to teach computer to recognize my voice not the other way around.

daanzu 4 years ago

With an open system/engine, you can train your own personal speech model. For kaldi-active-grammar (https://github.com/daanzu/kaldi-active-grammar), you can do so without all that much difficulty, although the process/documentation could certainly use improvement.
I bootstrapped my personal speech model by retaining the commands from me using WSR. My voice is quite abnormal, and it took only 10 hours of speech data to train a model orders of magnitude more accurate than any generic model I've ever used. And of course, I retain much of my usage now with Kaldi, so my model improves more and more over time. A virtuous flywheel!

miguel-muniz 4 years ago

Has anyone here used Apple's built in Voice Control[1] feature in MacOS? I imagine having something built in the OS is better than third party software, but I haven't used any so I don't know

[1] https://support.apple.com/en-us/HT210539

rileyphone 4 years ago

I’ve been messing around with https://github.com/ideasman42/nerd-dictation which, with the big model, gives surprisingly accurate local detections. Definitely more diy/hacker focused than actually being a solution though.

browningstreet 4 years ago

Alexa can't even hear the 3 things I say to it every single day with any accuracy.

But, it seems like all voice control development keeps getting bought up by the Big 3, so it's not likely to have any significant breakthroughs independent of what Apple, Google and Amazon think voice control is good for.

doesnotexist 4 years ago

Have you read this blog post by Josh W. Comeau outlining his experience with Talon for a developer workflow?

https://www.joshwcomeau.com/blog/hands-free-coding/

1vuio0pswjnm7 4 years ago

Surprised that something like Talon + RPi has not tapped into the "smart speaker" market.

cpach 4 years ago

Just a thought: Have you tried Dasher…?

It’s an alternative input method. Might be worth giving a try.

https://www.inference.org.uk/dasher/DasherSummary2.html

kleer001 4 years ago

oooh, I always wanted that. Sad it doesn't look like it's advanced much.

philonoist 4 years ago

For those who need immediate help for RSI, use ->

Voice Finger by Cozendy [$9.99]

Lenovo Voice Control from msstore [free]

Amazon Alexa from msstore [free]

"Win Key + h" for the inbuilt text box dictation [inbuilt]

serenade.ai [$$]

I don't have an exact answer to you OP but I hope someone builds a helpful one for you.

daviddever23box 4 years ago

Arguably, one might wish to create a audible, non-linguistic shorthand for positional control, that would allow for higher efficiency when, say, retouching an image within Photoshop, but without the use of hands.

singularity2001 4 years ago

Nuance started to completely monopolize the market 20 years ago and had crippled a lot of innovation in that space. It's still a minefield of ugly patents for any commercial contestants.

mleonhard 4 years ago

Switching to a tenting split keyboard (Goldtouch V2) and vertical mouse (Evoluent) reduced my RSI. Strength training (Les Mills Body Pump) is what finally solved it. Have you tried those?

wizzerking 4 years ago

The major problem when using voice to control a machine is tremors in the voice as the work day proceeds, when the person is stressed, and if the person is experiencing health issues. All these situations/reason will change the timber, and in some cases the intonation. Like 'emphasis on the syllable. Now top that off with accents, like a Hispanic person, or regional slang. Deep Learning kits like https://github.com/FreddieAbad/Voice-Recognition-using-Deep-... are making headway but still far from general voice recognition

Avatars 4 years ago

"Why is there no high quality method for voice control of a PC?" For the same reason there's no standardized encryption for everyone's comms. Or, 'Why is there no software for gps that works on pc's that is easy and readily available?'. Same reason.

There is a voice assistant ap for Android that uses vosk called Dicio (f-droid). Storage is cheap and easy. Processing power is there even in cheap 3rd world phones. I personally detest typing and would love to talk to my devices without any 3rd party nonsense requirements. Truly there is none because the powers that be do not want everyone thinking they are in control, essentially of anything.

Arubis 4 years ago

Absent regulation or other incentives to nudge the market otherwise, the overwhelming majority of software is and will be written to use existing input methods--i.e. adding a new input method isn't the core competency of a team creating the world's best todo list app.

With that precondition, any voice-to-control layer on the desktop is in the tough situation of translating between voice input and a piece of software that was designed without voice input in mind.

Google and Siri, etc., aren't as beholden to the desktop/browser interface paradigm, so they don't have to perform this interface translation.

walls 4 years ago

A friend of mine uses VoiceAttack in a few VR games and it seems to work decently for triggering actions. Not sure if it's any good at transcription though.

wnolens 4 years ago

That sounds exhausting.

"Open this program"

"Minimize"

"Focus on this text input"

..dictate..

"switch to command mode"

"save and close"

i'd rather just: "click click tab type ctrl-S"

falcolas 4 years ago

The actions are a bit more like
"move mouse to this 100x100 pixel square, click twice within 100ms"
"move mouse to this 20x20 pixel square, click once"
"Move mouse to this 100x1000 pixel rectangle, click once"
"Type text at a rate 1/5 (1/2 if you're particularly fast) speaking rate"
"Move your pinkie (your weakest finger) to 'ctrl' and click, move your index to 's' and click, release both, verify it worked with a visual cue then either another 20x20 mouse maneuver, or "move your thumb to 'alt', and your index finger to 'f4' (assuming you have access to the function keys), click and release"
Moving a mouse to a very specific spot on a screen is a relatively slow - and hard if you have any motor control issues - task.
- wnolens 4 years ago
  
  You're right. But OP didn't understand why it doesn't exist because "market is so big."
  I presumed they meant more than the extreme edge of RSI sufferers. So I ran the thought experiment.
  I've had a mild RSI. The solution was get fancy ergo mouse/keyboard/desk/chair, and retrain myself. I've even seen a guy use a joystick instead of a mouse.
sp332 4 years ago

The post is about people who physically can't do that.
- warrenm 4 years ago
  
  OP claims the market is "huge"
  It's not
  It's tiny (at best)
  Tiny markets don't tend to get much attention
  - falcolas 4 years ago
    
    It's a market (that is, the market for people with motor control issues in their hands) that every single human on earth (that doesn't die first) will be in later in their lives.
  - wnolens 4 years ago
    
    Bingo
cf 4 years ago

Most of these systems entail developing a shorthand. For operations you expect to do a lot you assign one-syllable commands.
- marginalia_nu 4 years ago
  
  Let's not create a false dichotomy between mouse control and voice control. There are other alternatives that are arguably less handicapping than being reduced to voice commands.

amelius 4 years ago

Because Google is keeping all their crowdsourced voice data secret.

danShumway 4 years ago

I'll give an answer in a slightly separate direction: there aren't engines that are both good enough and open enough to hook into that Open Source communities can build around them.

There are two ways that new software gets built: either the market is big enough and accessible enough that commercial software gets built, or the software is easy enough to build that hobbyists enter the space and solve their own problems. For example, the commercial market for keyboard-driven interfaces is also quite small, but we still have stuff like Sway. But a good keyboard-driven interface is easier to build than speech recognition.

I've been curious about this area for a while, but my understanding is voice-to-text Open Source solutions are still kind of primitive for general text transcribing. The libraries aren't very fun to work with, they're often embedded Python/Java "stuff", and the accuracy isn't great if you advance past the level of text transcription. Additionally, controlling computers and hooking into X or Wayland feels a bit hacky.

That being said, I'll push back on people who are saying that no one would want to control an interface this way. The success of systems like Alexa/Siri/Google are pretty definitive proof to me that (all their weaknesses side) there is a market for voice interfaces. But the ties between that market and the desktop are not strong, and the ecosystem isn't open enough to really build on in that direction.

I suspect that until efforts like Mozilla's open speech datasets pick up more steam and become competitive (if they ever do), it's going to be kind of laggy to find solutions because it's not immediately obvious how to enter the market, either as a commercial company or as an Open Source dev. But maybe I'm wrong and I just haven't researched it enough and the area is totally ripe for disruption. Maybe for people with RSI they'd tolerate something like clipping a bluetooth mic to their lapel or something and that would boost accuracy. Maybe there's another way to approach entering code that isn't just straight text recognition, possibly combining it with some kind of AST or code analysis that made it easier to guess what people were saying.

In any case, I don't think the problem is that people don't want to talk to their computers. Personally I don't like using voice assistants, but they are very popular, in no small part because of the voice part. So maybe there is an evolution of desktop UI controls that could become really popular, or at least competitive with entrenched solutions for people with limited mobility or RSI. But it would require someone to introduce some kind of actual UX innovation into the space, or to find a way of getting over the moat around good recognition and OS integration.

warrenm 4 years ago

>The market is huge

Apparently ... it's not

Or, rather, it's not YET "huge"

Sure - half the planet is online, but they're speaking myriad languages in more combinations of enunciation, dialect, and accent than is probably even calculable

>the Natural Language Processing of "OK Google" and Siri are quite refined at this point

Totally different to ask for today's weather and to tell a computer what to do - just like it's totally different to hit your favorite search engine and type "what is Pluto's orbit" and to write the search engine that goes off and does what you asked (and even when it does go off and do it, it still returns multiple (often conflicting) results - which leads to the whole problem of identifying authority online (something I wrote about 15+ years ago https://antipaucity.com/2006/10/23/authority-issues-online/#...))

It's also worlds different to be able to respond to variations on a theme of maybe a couple hundred search keywords (is it even that many?) and the literally unlimited number of commands people issue to their computing devices every day. Let's even say Siri is That Good™ - you've got a MacBook, iPhone, and iPad on your desk ...which one should respond when you say, "Hey, Siri"? Why that one vs this one? Do you have to start every command with the name of the device? Maybe that's not so hard at home (maybe), but get into corporate environments with naming conventions like H5GG71WLD? ... or dozens/scores/hundreds of people within listening distance of everyone's microphones getting triggered by other conversations in the room, conference calls, your cubemates' inability to attenuate their voices and aim only at their laptop when talking ...

It's a nightmare to think about - practically, let alone computationally

Most people look at the example of, say, Star Trek for voice commands to "the computer". Ever notice the computer only responds when the script demands it? Geordi shouting in Engineering commands to his team or panicked messages to the bridge are never misinterpreted by the computer as commands to it

That's mighty convenient - and not at all representative of anything resembling a reality we can create [yet]

Maybe in another few decades or centuries ... but I'd wager probably not

Another consideration: speaking is very slow compared to a click, tap, or typing a few characters at a prompt. Why would you want to intentionally make your human-to-device interactions more clumsy and error-prone?

4midoriOP 4 years ago

OP here. Great comments and ideas, all. A few notes: * Talon is pretty great * I think the market for text to speech and voice control is huge, and maybe Dragon/Nuance rules it because of their patents, but oh, does it suck. Like being stuck on Windows 95 or something. * Voice Recognition is in fact currently good enough to get real work done efficiently * Serious RSI can't be fixed with ergonomics or better devices * If there were a modern alternative to Dragon, it would solve a chunk of the problem
It's true that computer control currently requires a lot of customization, but I see no practical reason why we can't at least make simple commands fast and accurate, i.e., 'create new html document in VS Code'.

sleepingadmin 4 years ago

Certainly exists and I have setup this for various blind people who make due. Unfortunately dont recall what it was exactly but they bought it and all that.

The thing about voice is how weak it is. Even if you've well trained it and you speak well, which i don't. It wont be as good as a keyboard.

Putting work into voice like this for productivity is pointless. Any effort is best placed in brain computer interfaces. Hopefully not surgically required, like neurolink is doing. More of a headset like Valve and openbci is doing.

Lets just wear a headset and work, keyboards can just be there in case you need them.

vasco 4 years ago

Agree completely on your points about brain computer interfaces and voice not being worth investing in outside of supporting people with accessibility issues, unfortunately.
It's also super weird to speak to a computer. Typing, touching or thinking are all fine, but somehow sitting in a room talking to a machine is a bit weird, even though it's not weird if I'm on a call, I can't explain it. Are there others with similar experience?

Settings

Ask HN: Why is there no high quality method for voice control of a PC?

Keyboard Shortcuts