HN: Shoute – Yes, another dictation app. Why the last 5% is the whole product
getshoute.comCongrats on launch!
I'm a little bit confused by this. You say it supports 100+ languages, but on the landing page some languages are colored in and the rest are greyed out, and the total number doesn't seem to amount to 100+.
Also, presumably the local model doesn't cost you anything per token. So why isn't that one the free tier, with the cloud model being in the paid plan? Wouldn't that help you get a lot more users cost-efficiently?
Lastly, your landing page has a lot of "AI hallmarks". This may or may not be a bad thing, but at least on here I imagine many people are fatigued from this pattern.
I'm all for apps that don't use Electron. What did you use for this?
Interesting that you use WhisperKit for local transcription. We built something comparable in speech-swift (which I maintain), focusing on on-device ASR with Qwen3-ASR, which supports 52 languages and achieves an RTF of 0.06 on Apple Silicon. The tradeoff is full native Swift async integration. https://github.com/soniqo/speech-swift
Cool project!
What we found was that for super fast tap to speak and paste text, WhisperKit is already close to instant (basically realtime for Apple Silicon). Faster than realtime is mostly only useful for batch processing of audio which is not really our product.
I've been working on Shoute, a speak-to-text app for Mac and Windows that's built around one idea: the full loop has to feel instant
I do know this isn’t a new category. A lot of people here already have some version of this: whisper.cpp behind a hotkey, macOS dictation, SuperWhisper, Wispr Flow, or some other hand-rolled version.
I built one anyway because I kept bouncing off dictation tools in my actual workday.
My problem was not “can an app transcribe my voice?” Most of them can, and impressively well. The problem was the full loop: press shortcut -> speak -> release -> cleaned up text appears where I was already typing - and that this happens consistently, quickly, day after day.
If that loop has enough delay, I lose the thread. If the output is too raw, I am back to editing. If the app needs screenshots to understand context, I start feeling uneasy about using it everywhere. You want to be confident that it always will work - or else you lose trust in it.
So the version I wanted was pretty narrow: - it should feel super quick for short everyday dictation
- the output should be cleaned up before insertion
- it should work across ALL the usual apps
- it should never lose data
- it should support both local and cloud modes (personally for flying but privacy too for specific things)
- it should use only minimal context
Shoute solves all of that really well and is lightweight (native code) and fluid to use day to day. It has a generous free tier (2000 words/week - should be enough for most casual use), one time purchase for both local and cloud, and cloud with subscription ($6.99/mo) for when folks who need the latest cloud models. Not a fan of subscriptions too but hard to have ongoing support for the latest cloud models without it.
Learned some really cool things building this:
The interesting eng lesson for me has been that voice UX is so much more latency-sensitive than normal app UX - the major part of the work on this was on making it consistently low latency end to end.
On latency - the model is only one part of the delay. Shoute runs three backends for different modes and fallback (ElevenLabs streaming, Groq Whisper, and WhisperKit for on-device) and each has different latency profiles. For short recordings (~15s is my avg - Shoute can do really long but not the primary use case for hour long recordings), the annoying delays often come from everything around the model: audio finalization, connection warmup, WebSocket setup, token fetching, fallback paths, local model cold starts, and finally pasting into the active app. Getting all this right consistently took significant time and eng effort despite Claude helping with all of it - taste and architectural direction is still absolutely essential in 2026, especially with desktop and system apps.
Native development is still hard - things like WebSockets are fundamentally web technologies and their native libraries have a lot of hard edges and inconsistencies that only show up when you use something 100 times a day - took some engineering to get around this. Native does make the UX fast but it almost made me wish I had chosen Electron for something with this much network management, but speed and resource efficiency is worth going native for.
Okay, this already feels long - please try it, let me know how it feels, glad to hear feedback and feature requests. Thank you! Here is the link: https://getshoute.com/deepdive
Some one asked this but got flagged so still answering it here:
ejoso 2 hours ago: The latency breakdown is honest and the right frame. Most of the delay budget gets eaten before and after the model, and that is harder to fix than it looks. Cold starts on local Whisper variants and WebSocket warmup on cloud paths are both worse than benchmarks suggest. The differentiation question I keep landing on: whisper.cpp behind a hotkey with a paste shim is a solved afternoon project for a certain kind of person. The real gap is consistency across arbitrary apps, and that is genuinely harder than it sounds.
For the fully local path (flying, privacy-sensitive) what does Shoute add beyond well-packaged WhisperKit with better insertion handling? That answer is either the core pitch or an honest scoping of who this is for.
------------------------------------------------------
This makes a good point - latency and consistency are indeed the hardest to get right. For the fully local path - the major value add for Shoute is still the speed and consistency, getting accessibility settings right and the flow from the model to text is useful for many folks. That and consistently keeping it updated with model updates as models improve and with OS updates.