Simple Speech-to-Text on the '10 Cents' CH32V003 Microcontroller

141 points by victor82 2 years ago · 36 comments

Reader

sen 2 years ago

It's really cool how people are taking these tiny cheap MCUs and making them do fun things for hobbyists. There's nothing better than a project with zero real-world use-case but that's done just because it was a challenge.

Eg:

Making the CH32V003 programmable via USB: https://www.youtube.com/watch?v=j-QazXghkLY

CH32V003 "Super-Cluster": https://www.youtube.com/watch?v=lh93FayWHqw

Powering a Nixie Tube from USB with a CH32V003: https://www.youtube.com/watch?v=-4d3PgEXhdY

(A good rule in life in general is to just always watch CNLohr and Bitluni if you're into "useless but amazing hardware projects")

daneel_w 2 years ago

The software portion of this has a common real-world use case: keyword spotting (voice command) in interactive phone calls.
- eternityforest 2 years ago
  
  The general concept has a real world use, but this specific implementation probably doesn't perform as well as most of the already available ones on more powerful hardware.
  Maybe it could have an IRL use for efficient wake word spotting or the like though.
  - fennecfoxy 2 years ago
    
    The point of a wake word is that it's using low power hardware, so everything can sleep, that's why it's a wake word.
    Though tbf much more powerful arm processors are indeed still very low power.

jononor 2 years ago

Minor nitpick/clarification. As it stands this is doing detection of a fixed, small vocabulary of words - not open ended text to speech covering entire language. Also called speech command recognition / keyword spotting. Which is already impressive and useful. General STT on this grade hardware would be an amazing feat!

kragen 2 years ago

this is exciting! it's still at prototype stage: 'getting about 90% accuracy [distinguishing between the spoken digits 'zero' to 'nine',] with the code as it stands.'

i wonder if modern continuous optimization algorithms could yield a neural network that would do better than this mfcc approach at, perhaps, even lower computational cost

they seem to have gotten more expensive lately, though (11.83¢ in quantity 500), and lcsc is out of stock on the ch32v003. they only have in stock ch32v203 and up, which costs 37.5¢. https://www.lcsc.com/products/Microcontroller-Units-MCUs-MPU...

digi-key, as usual, doesn't list the part at all

wildzzz 2 years ago

If you search a part number in Google and the only datasheet result is from an unpronounceable Chinese website, there's a very good chance it's not going to be on digikey. LCSC or AliExpress will be your only options. Even when designing boards, you have to consider whether you want to pick parts from the LCSC library or Digikey because they don't carry all the same parts and even parts that you would think are jellybean don't have the exact same package on both sites (especially SOT packages, similarly sized but not the exact same).
- kragen 2 years ago
  
  if you think chinese websites are 'unpronounceable' you probably shouldn't try to design hardware
  - wildzzz 2 years ago
    
    I do design hardware for a living, it all uses quality components from trusted Western vendors and suppliers.
woodson 2 years ago

Replacing the codebook approach with a statistical/DNN is more likely to give higher accuracy than getting rid of mfccs as spectral representation (at least in general ASR). (Arguably, using Mel spectra was the least controversial design choice made for Whisper.)
- kragen 2 years ago
  
  thank you! those are good points. i was thinking that maybe you could get by with some relatively sparse convolutional layers over the raw sound samples and save yourself the expense of doing a real fourier transform, but maybe that's a dumb idea
  - jononor 2 years ago
    
    It is a good idea that is worth trying out! Like anything there are tradeoffs though, so it is not guaranteed to be better for this particular circumstance. The ability to use low bitdepth integer operations (which easy for a neural net) should be beneficial for a CPU without a floating point unit. But weights need to be stored - and it can be difficult to match FFT efficiency - depending on what resolution is actually needed/utilized.
tonetegeatinst 2 years ago

They also don't list any HBM memory of gddr7 which is frustrating as I'm trying to use kicad to design a cheaper PCI card.....but finding any decent documentation on those chips is impossible at the time.
- kragen 2 years ago
  
  you may need a wechat account to sign up for the necessary chinese-language-only web forums in shenzhen

jononor 2 years ago

Really nice project! Great care is taken in optimized audio feature extraction, very cool to see. I am working on a very similar project[1], using the Puya PY32. I opted for that chip over CH32 since it has DMA (simplifies efficient ADC input at audio rates), and 1 kB more RAM. For a couple of cents more. I have written about some of the hardware constraints on low cost audio already, and am getting to the audio DSP/ML in the next months.

1. https://hackaday.io/project/194511-1-dollar-tinyml

buescher 2 years ago

I wonder how this performs compared to the "voice recognition" VCP200 chip sold by Radio Shack in the eighties (maybe early nineties?). https://21stdigitalhome.blogspot.com/2013/06/vcp200-voice-re...

Also be interesting to know if that Voice Control Products ever had a real design win.

I gather the VCP200 was a mask-programmed M6804 microcontroller. The M6804 was a strange and obscure beast, apparently a cost-reduced, internally serial ("1-bit"), partial reimplementation of the M6805, which was one of the first Motorola 8-bit microcontrollers based on the 6800. Max bus speed of 2.75MHz, with an instruction cycle time of 44 microseconds. 32 bytes of RAM and 1K mask-programmed ROM. No ADC. http://www.bitsavers.org/components/motorola/6804/M6804_MCU_...

One should be able to do better with about any modern microcontroller. Then again, for all I know the VCP200 was not fit to even the modest tasks (looks like toy/novelty/hobbyist) it was marketed for back then.

hales 2 years ago

Is there a recorded demo? Reading about speech-to-text is different from hearing it.

smcameron 2 years ago

Speech to text, not text to speech. There's nothing to hear but your own voice.
- anonzzzies 2 years ago
  
  Well, considering everything outside/before whisper would be less than a 40% accurate on my voice (don’t know the reason and now whisper is close to 100% even with tech stuff/abbreviations). Things like Siri, Google, Alexa, Dragon etc all never understand (I stopped trying, so it might have improved, but I did try not long ago) anything I say. When I ask for the weather, something like Siri looks on Google what the border is etc. I am not native english, however I am fluent (work in English fulltime) and humans never have any issues; also, in my own language, none of them work either, except whisper, even locally running (which, like said, might’ve improved recently).
  So it would be interesting to hear how articulated you would need to speak and have different people with accents and such.
  - qkeast 2 years ago
    
    I experience exactly the same. For me it’s an “accent” caused by profound hearing loss. No issues in everyday conversation, but almost zero success with any speech to text tool.
- elevaet 2 years ago
  
  Could still have a demo showing how example recordings got transcribed

watersb 2 years ago

About 10 years ago, I used a basic flip phone, vendor locked to a $15/month Verizon plan.

The Wal Mart page for a similar device is still up at

https://www.walmart.com/ip/Verizon-Wireless-Samsung-Gusto-3-...

Among other things, it had limited speech recognition -- you could say "Call" followed by a name, and it would match that against the address book on device.

We live in strange times.

wildzzz 2 years ago

My 2006 Infiniti had voice commands for calling people in your address book. Road noise trashed the microphone quality so it only really worked well when you were at a stop.
Handsfree mics in cars still suck and Bluetooth handsfree audio quality sucks too, not sure why this is still a problem. I get backwards compatibility issues but is good compression that difficult in newer devices?
kragen 2 years ago

i had a sprint samsung sch-6100 flip phone with a similar voice recognition feature at the end of last millennium, but it would only match the name you told it to call against names you'd previously made training voice recordings of. that is, it wasn't trying to do speech-to-text or text-to-speech; it was just trying to discriminate among the particular recordings you had made previously
i didn't use the feature very much because to activate it, iirc, you had to either flip the phone open or press a button on a hands-free headset. but obviously this wasn't a bluetooth headset, and the phone couldn't play music, so you wouldn't walk around with it in your ears all the time; unless you'd just gotten off a different call, you'd have to get it out, put it in your ears, plug it in, and then you could use the speech recognition feature
so unless you were a secretary or something, making one phone call after another for hours (to a small number of people), you might as well just use speed dial
selcuka 2 years ago

Yeah, even 24 year-old Nokia 3310 had some form of voice dialling [1].
OS/2 Warp 4.0 (1996) came with speech recognition and dictation software [2]. The CPUs it supported back then weren't much better than a 10-year old phone.
[1] https://en.wikipedia.org/wiki/Nokia_3310
[2] https://www.os2world.com/wiki/index.php?title=OS/2_Warp_4:_%...
- kalleboo 2 years ago
  
  In addition, way back in 1993, Apple released speech recognition with the original AV Macs (which were outfitted with 55 MHz AT&T 3210 DSPs in addition to their 25+ MHz Motorola 68040s) which was then also supported on the PowerPC Macs released the following year (that started at 60 MHz)

londons_explore 2 years ago

Projects like this really open the doors to coin sized devices which can record months of audio from a tiny battery.

You can imagine employers who might want a record of everything said on their premises for example.

tonetegeatinst 2 years ago

Ah sweet man made horrors that I now am going to be thinking about.

londons_explore 2 years ago

If you uploaded some training data somewhere, perhaps to some links to simulators, you might get a crowd of people code-golfing this to maximize accuracy.

countvonbalzac 2 years ago

What's the minimum spec chip you will need to run the smallest whisper model (looks like that's 39M parameters)?

tonetegeatinst 2 years ago

That's what I though seeing this. Wisper does English best but is the best iv seen when it comes to other languages.
jononor 2 years ago

ESP32-S3 or ARM Cortex M7, probably.

pcdoodle 2 years ago

90% accuracy on 10 digits is pretty disappointing but cool project.

GeoAtreides 2 years ago

On a $0.10 (10 cents) chip with 16K storage and 2K (2048 bytes) RAM? That runs at a maximum of 48 MHz?
Dunno boss, for a PoC seems pretty impressive to me.
- londons_explore 2 years ago
  
  Very cool for a PoC. I suspect a dense neural net approach would give much better results though.

Settings

Simple Speech-to-Text on the '10 Cents' CH32V003 Microcontroller

Keyboard Shortcuts