Simple Speech-to-Text on the '10 Cents' CH32V003 Microcontroller
github.comIt's really cool how people are taking these tiny cheap MCUs and making them do fun things for hobbyists. There's nothing better than a project with zero real-world use-case but that's done just because it was a challenge.
Eg:
Making the CH32V003 programmable via USB: https://www.youtube.com/watch?v=j-QazXghkLY
CH32V003 "Super-Cluster": https://www.youtube.com/watch?v=lh93FayWHqw
Powering a Nixie Tube from USB with a CH32V003: https://www.youtube.com/watch?v=-4d3PgEXhdY
(A good rule in life in general is to just always watch CNLohr and Bitluni if you're into "useless but amazing hardware projects")
The software portion of this has a common real-world use case: keyword spotting (voice command) in interactive phone calls.
The general concept has a real world use, but this specific implementation probably doesn't perform as well as most of the already available ones on more powerful hardware.
Maybe it could have an IRL use for efficient wake word spotting or the like though.
The point of a wake word is that it's using low power hardware, so everything can sleep, that's why it's a wake word.
Though tbf much more powerful arm processors are indeed still very low power.
Minor nitpick/clarification. As it stands this is doing detection of a fixed, small vocabulary of words - not open ended text to speech covering entire language. Also called speech command recognition / keyword spotting. Which is already impressive and useful. General STT on this grade hardware would be an amazing feat!
this is exciting! it's still at prototype stage: 'getting about 90% accuracy [distinguishing between the spoken digits 'zero' to 'nine',] with the code as it stands.'
i wonder if modern continuous optimization algorithms could yield a neural network that would do better than this mfcc approach at, perhaps, even lower computational cost
they seem to have gotten more expensive lately, though (11.83¢ in quantity 500), and lcsc is out of stock on the ch32v003. they only have in stock ch32v203 and up, which costs 37.5¢. https://www.lcsc.com/products/Microcontroller-Units-MCUs-MPU...
digi-key, as usual, doesn't list the part at all
If you search a part number in Google and the only datasheet result is from an unpronounceable Chinese website, there's a very good chance it's not going to be on digikey. LCSC or AliExpress will be your only options. Even when designing boards, you have to consider whether you want to pick parts from the LCSC library or Digikey because they don't carry all the same parts and even parts that you would think are jellybean don't have the exact same package on both sites (especially SOT packages, similarly sized but not the exact same).
if you think chinese websites are 'unpronounceable' you probably shouldn't try to design hardware
I do design hardware for a living, it all uses quality components from trusted Western vendors and suppliers.
Replacing the codebook approach with a statistical/DNN is more likely to give higher accuracy than getting rid of mfccs as spectral representation (at least in general ASR). (Arguably, using Mel spectra was the least controversial design choice made for Whisper.)
thank you! those are good points. i was thinking that maybe you could get by with some relatively sparse convolutional layers over the raw sound samples and save yourself the expense of doing a real fourier transform, but maybe that's a dumb idea
It is a good idea that is worth trying out! Like anything there are tradeoffs though, so it is not guaranteed to be better for this particular circumstance. The ability to use low bitdepth integer operations (which easy for a neural net) should be beneficial for a CPU without a floating point unit. But weights need to be stored - and it can be difficult to match FFT efficiency - depending on what resolution is actually needed/utilized.
They also don't list any HBM memory of gddr7 which is frustrating as I'm trying to use kicad to design a cheaper PCI card.....but finding any decent documentation on those chips is impossible at the time.
you may need a wechat account to sign up for the necessary chinese-language-only web forums in shenzhen
Really nice project! Great care is taken in optimized audio feature extraction, very cool to see. I am working on a very similar project[1], using the Puya PY32. I opted for that chip over CH32 since it has DMA (simplifies efficient ADC input at audio rates), and 1 kB more RAM. For a couple of cents more. I have written about some of the hardware constraints on low cost audio already, and am getting to the audio DSP/ML in the next months.
I wonder how this performs compared to the "voice recognition" VCP200 chip sold by Radio Shack in the eighties (maybe early nineties?). https://21stdigitalhome.blogspot.com/2013/06/vcp200-voice-re...
Also be interesting to know if that Voice Control Products ever had a real design win.
I gather the VCP200 was a mask-programmed M6804 microcontroller. The M6804 was a strange and obscure beast, apparently a cost-reduced, internally serial ("1-bit"), partial reimplementation of the M6805, which was one of the first Motorola 8-bit microcontrollers based on the 6800. Max bus speed of 2.75MHz, with an instruction cycle time of 44 microseconds. 32 bytes of RAM and 1K mask-programmed ROM. No ADC. http://www.bitsavers.org/components/motorola/6804/M6804_MCU_...
One should be able to do better with about any modern microcontroller. Then again, for all I know the VCP200 was not fit to even the modest tasks (looks like toy/novelty/hobbyist) it was marketed for back then.
Is there a recorded demo? Reading about speech-to-text is different from hearing it.
Speech to text, not text to speech. There's nothing to hear but your own voice.
Well, considering everything outside/before whisper would be less than a 40% accurate on my voice (don’t know the reason and now whisper is close to 100% even with tech stuff/abbreviations). Things like Siri, Google, Alexa, Dragon etc all never understand (I stopped trying, so it might have improved, but I did try not long ago) anything I say. When I ask for the weather, something like Siri looks on Google what the border is etc. I am not native english, however I am fluent (work in English fulltime) and humans never have any issues; also, in my own language, none of them work either, except whisper, even locally running (which, like said, might’ve improved recently).
So it would be interesting to hear how articulated you would need to speak and have different people with accents and such.
I experience exactly the same. For me it’s an “accent” caused by profound hearing loss. No issues in everyday conversation, but almost zero success with any speech to text tool.
Could still have a demo showing how example recordings got transcribed
About 10 years ago, I used a basic flip phone, vendor locked to a $15/month Verizon plan.
The Wal Mart page for a similar device is still up at
https://www.walmart.com/ip/Verizon-Wireless-Samsung-Gusto-3-...
Among other things, it had limited speech recognition -- you could say "Call" followed by a name, and it would match that against the address book on device.
We live in strange times.
My 2006 Infiniti had voice commands for calling people in your address book. Road noise trashed the microphone quality so it only really worked well when you were at a stop.
Handsfree mics in cars still suck and Bluetooth handsfree audio quality sucks too, not sure why this is still a problem. I get backwards compatibility issues but is good compression that difficult in newer devices?
i had a sprint samsung sch-6100 flip phone with a similar voice recognition feature at the end of last millennium, but it would only match the name you told it to call against names you'd previously made training voice recordings of. that is, it wasn't trying to do speech-to-text or text-to-speech; it was just trying to discriminate among the particular recordings you had made previously
i didn't use the feature very much because to activate it, iirc, you had to either flip the phone open or press a button on a hands-free headset. but obviously this wasn't a bluetooth headset, and the phone couldn't play music, so you wouldn't walk around with it in your ears all the time; unless you'd just gotten off a different call, you'd have to get it out, put it in your ears, plug it in, and then you could use the speech recognition feature
so unless you were a secretary or something, making one phone call after another for hours (to a small number of people), you might as well just use speed dial
Yeah, even 24 year-old Nokia 3310 had some form of voice dialling [1].
OS/2 Warp 4.0 (1996) came with speech recognition and dictation software [2]. The CPUs it supported back then weren't much better than a 10-year old phone.
[1] https://en.wikipedia.org/wiki/Nokia_3310
[2] https://www.os2world.com/wiki/index.php?title=OS/2_Warp_4:_%...
In addition, way back in 1993, Apple released speech recognition with the original AV Macs (which were outfitted with 55 MHz AT&T 3210 DSPs in addition to their 25+ MHz Motorola 68040s) which was then also supported on the PowerPC Macs released the following year (that started at 60 MHz)
Projects like this really open the doors to coin sized devices which can record months of audio from a tiny battery.
You can imagine employers who might want a record of everything said on their premises for example.
Ah sweet man made horrors that I now am going to be thinking about.
If you uploaded some training data somewhere, perhaps to some links to simulators, you might get a crowd of people code-golfing this to maximize accuracy.
What's the minimum spec chip you will need to run the smallest whisper model (looks like that's 39M parameters)?
That's what I though seeing this. Wisper does English best but is the best iv seen when it comes to other languages.
ESP32-S3 or ARM Cortex M7, probably.
90% accuracy on 10 digits is pretty disappointing but cool project.
On a $0.10 (10 cents) chip with 16K storage and 2K (2048 bytes) RAM? That runs at a maximum of 48 MHz?
Dunno boss, for a PoC seems pretty impressive to me.
Very cool for a PoC. I suspect a dense neural net approach would give much better results though.