According to the OpenASR Leaderboard [1], looks like Parakeet V2/V3 and Canary-Qwen (a Qwen finetune) handily beat Moonshine. All 3 models are open, but Parakeet is the smallest of the 3. I use Parakeet V3 with Handy and it works great locally for me.
It is about the parameter numbers if what you care about is edge devices with limited RAM. Beyond a certain size your model just doesn't fit, it doesn't matter how good it is - you still can't run it.
So I'm kinda new to this whole parakeet and moonshine stuff, and I'm able to run parakeet on a low end CPU without issues, so I'm curious as to how much that extra savings on parameters is actually gonna translate.
Oh and I type this in handy with just my voice and parakeet version three, which is absolutely crazy.
I'm building a local-first transcription iOS app and have been on Whisper Medium, switching to Parakeet V3 based on this.
One note for anyone using Handy with codex-cli on macOS: the default "Option + Space" shortcut inserts spaces mid-speech. "Left Ctrl + Fn" works cleanly instead. I'm curious to know which shortcuts you're using.
By the way, I've been using a Whisper model, specifically WhisperX, to do all my work, and for whatever reason I just simply was not familiar with the Handy app. I've now downloaded and used it, and what a great suggestion. Thank you for putting it here, along with the direct link to the leaderboard.
I can tell that this is now definitely going to be my go-to model and app on all my clients.
To this comment and all the other comments talking about handy below this comment. I tried handy right now and it's super amazing. I'm speaking this from Handy. This is so cool, man.
And handy even takes care of all the punctuation, which is really nice.
Thanks a lot for suggesting it to me. I actually wanted something like this, and I was using something like Google Docs, and it required me to use Chrome to get the speech to text version, and I actually ended up using Orion for that because Orion can actually work as a Chrome for some reason while still having both Firefox and Chrome extension support. So and I had it installed, but yeah.
This is really amazing and actually a sort of lifesaver actually, so thanks a lot, man.
Now I can actually just speak and this can convert this to text without having to go through any non-local model or Google Docs or whatever anything else.
Why is this so good man? It's so good
man, I actually now am thinking that I had like fully maxed out my typing speed to like hundred-120. But like this can actually write it faster. you know it's pretty amazing actually.
Have a nice day, or as I abbreviate it, HAND, smiley face. :D
I'm looking to switch from feeding the default android "recorder" app's .WAV into Gemini 3 Pro (via the app) with (usually just) a `Transcribe this please:` prompt; content is usually German voice instructions/explanation for how to do/approach some sysadmin stuff; there does tend to be some amount of interjecting (primarily for clarifications(-posing/-requesting)) by me to resolve ambiguity as early as possible/practical.
If e.g. parakeet can be run on my phone in real time showing the transcript live:
- with latency low enough to be "comfortable enough" for the instructor to keep an eye on and approve the transcribed instructions
[not necessarily every word of the transcript, i.e., a commanded "edit" doesn't need to be applied in the outcome as long as it's nature is otherwise clear enough to not add meaningful amounts of ambiguity to the final "written" instructions]
by glancing at the screen while dictating the explanation (and blurting out any transcription complaints as soon as that's possible without breaking one's own string-of-thought or spoken grammar too much)
, I'd very happily switch to that approach instead of what I was doing.
Bonus if there's a no-bulky-or-expensive-hardware way to accommodate us both speaking over each other so I won't have to _interrupt_ his speaking just to put a clarifying comment (on what he just said) in the transcript for him to see and sign off, where the at least "only" briefly interrupts his thoughts right while he actually reads my transcribed words (he doesn't have to hear them, and it's better if he won't; I can probably get him to put on earmuffs to not hear me louder than he hears his thoughts, and a sufficiently-smoothed SNR meter for specifically his voice should take care him regulating his volume while the earmuffs mute it and I occasionally talk over him)...
Congrats on the results. The streaming aspect is what I find most exciting here.
I built a macOS dictation app (https://github.com/T0mSIlver/localvoxtral) on top of Voxtral Realtime, and the UX difference between streaming and offline STT is night and day. Words appearing while you're still talking completely changes the feedback loop. You catch errors in real time, you can adjust what you're saying mid-sentence, and the whole thing feels more natural. Going back to "record then wait" feels broken after that.
Curious how Moonshine's streaming latency compares in practice. Do you have numbers on time-to-first-token for the streaming mode? And on the serving side, do any of the integration options expose an OpenAI Realtime-compatible WebSocket endpoint?
I've helped many Twitch streamers set up https://github.com/royshil/obs-localvocal to plug transcription & translation into their streams, mainly for German audio to English subtitles.
I'd love a faster and more accurate option than Whisper, but streamers need something off-the-shelf they can install in their pipeline, like an OBS plugin which can just grab the audio from their OBS audio sources.
I see a couple obvious problems: this doesn't seem to support translation which is unfortunate, that's pretty key for this usecase. Also it only supports one language at a time, which is problematic with how streamers will frequently code-switch while talking to their chat in different languages or on Discord with their gameplay partners. Maybe such a plugin would be able to detect which language is spoken and route to one or the other model as needed?
Claiming higher accuracy than Whisper Large v3 is a bold opening move. Does your evaluation account for Whisper's notorious hallucination loops during silences (the classic 'Thank you for watching!'), or is this purely based on WER on clean datasets? Also, what's the VRAM footprint for edge deployments? If it fits on a standard 8GB Mac without quantization tricks, this is huge.
For those wondering about the language support, currently English, Arabic, Japanese, Korean, Mandarin, Spanish, Ukrainian, Vietnamese are available (most in Base size = 58M params)
Accuracy is often presumed to be english, which is fine, but it's a vague thing to say "higher" because does it mean higher in English only? Higher in some subset of languages? Which ones?
The minimum useful data for this stuff is a small table of language | WER for dataset
No idea why 'sudo pip install --break-system-packages moonshine-voice' is the recommended way to install on raspi?
The authors do acknowledge this though and give a slightly too complex way to do this with uv in an example project (FYI, you dont need to source anything if you use uv run)
Any plans regarding JavaScript support in the browser?
There was an issue with a demo but it's missing now. I can't recall for sure but I think I got it working locally myself too but then found it broke unexpectedly and I didn't manage to find out why.
I find it an even more weird practice for anyone working with speech or text models not in the first paragraph name the language it is meant for (and I do not mean the programming language bindings). How many English native speakers are there 5% of the world population?
My startup is making software for firefighters to use during missions on tablets, excited to see (when I get the time) if we can use this as a keyboard alternative on the device. It's a use case where avoiding "clunky" is important and a perfect usecase for speech-to-text.
Due to the sector being increasingly worried about "hybrid threats" we try to rely on the cloud as little as possible and run things either on device or with the possibility of being self-hosted/on-premise. I really like the direction your company is going in in this respect.
We'd probably need custom training -- we need Norwegian, and there's some lingo, e.g., "bravo one two" should become "B-1.2". While that can perhaps also be done with simple post-processing rules, we would also probably want such examples in training for improved recognition? Have no VC funding, but looking forward to getting some income so that we can send some of it in your direction :)
Interesting. Can we get in touch? I just sold my webapp/saas where I used NB-Whisper to transcribe Norwegian media (podcast, radio, TV) and offer alerts and search by indexing it using elasticsearch.
Edit: It was https://muninai.eu (I shut down the backend server yesterday so the functionality is disabled).
Nice work. One metric I’d really like to see for streaming use cases is partial stability, not just final WER.
For voice agents, the painful failure mode is partials getting rewritten every few hundred ms. If you can share it, metrics like median first-token latency, real-time factor, and "% partial tokens revised after 1s / 3s" on noisy far-field audio would make comparisons much more actionable.
If those numbers look good, this seems very promising for local assistant pipelines.
Tangentially, have you got any idea what the equivalent "partial tokens revised" rate for humans is? I know I've consciously experienced backtracking and re-interpreting words before, and presumably it happens subconsciously all the time. But that means there's a bound on how low it's reasonable to expect that rate to be, and I don't have an intuition for what it is.
Which program does support it to allow streaming? Currently using spokenly and parakeet but would like to transition to a model that is streaming instead of transcribing chunk wise.
This is awesome, well done guys, I’m gonna try it as my ASR component on the local voice assistant I’ve been building https://github.com/acatovic/ova. The tiny streaming latencies you show look insane
The streaming architecture looks really promising for edge deployments. One thing I'm curious about: how does the caching mechanism handle multiple concurrent audio streams? For example, in a meeting transcription scenario with 4-5 speakers, would each stream maintain its own cache, or is there shared state that could create bottlenecks?
I vibe-trained moonshine-tiny on amateur radio morse code last weekend, and was surprised at the ~2% CER I was seeing in evals and over the air performance was pretty acceptable for a couple hour run on a 4090.
haven't tested yet but I'm wondering how it will behave when talking about many IT jargon and tech acronyms. For those reason I had to mostly run LLM after STT but that was slowing done parakeet inference. Otherwise had problems to detect properly sometimes when talking about e.g. about CoreML, int8, fp16, half float, ARKit, AVFoundation, ONNX etc.
Implemented this to transcribe voice chat in a project and the streaming accuracy in English on this was unusable, even with the medium streaming model.
> This code, apart from the source in core/third-party, is licensed under the MIT License, see LICENSE in this repository.
> The English-language models are also released under the MIT License. Models for other languages are released under the Moonshine Community License, which is a non-commercial license.
> The code in core/third-party is licensed according to the terms of the open source projects it originates from, with details in a LICENSE file in each subfolder.
Presuming (I haven't checked myself) the git author information supports this, it should be fine to treat this as licensing the code it specifies under MIT; based on that license name being (to my understanding) unambiguous and license application being based on contract law and contract law basically having at it's very core the principle of "meeting of the minds" along with wilful infringement being really really hard to even argue for if the only thing that's separating it from being 100% clearly licensed in all proper ways being not copying in an MIT `LICENSE` template with date and author name pasted into it.
reading through readme.md
"License
This code, apart from the source in core/third-party, is licensed under the MIT License, see LICENSE in this repository.
The English-language models are also released under the MIT License. Models for other languages are released under the Moonshine Community License, which is a non-commercial license.
The code in core/third-party is licensed according to the terms of the open source projects it originates from, with details in a LICENSE file in each subfolder."
[1]: https://huggingface.co/spaces/hf-audio/open_asr_leaderboard
I'm actually a little surprised they haven't added model size to that chart.
https://huggingface.co/spaces/hf-audio/open_asr_leaderboard
Oh and I type this in handy with just my voice and parakeet version three, which is absolutely crazy.
https://github.com/kitlangton/Hex
That's unfortunate. I think I can update my version but I have heard some bad things about performance from the newer update from my elder brother.
One note for anyone using Handy with codex-cli on macOS: the default "Option + Space" shortcut inserts spaces mid-speech. "Left Ctrl + Fn" works cleanly instead. I'm curious to know which shortcuts you're using.
I can tell that this is now definitely going to be my go-to model and app on all my clients.
The one built in is much faster, and you only have to toggle it on.
Are these so much more accurate? I definitely have to correct stuff, but pretty good experience.
Also use speech to text on my iphone which seems to be the same accuracy.
And handy even takes care of all the punctuation, which is really nice.
Thanks a lot for suggesting it to me. I actually wanted something like this, and I was using something like Google Docs, and it required me to use Chrome to get the speech to text version, and I actually ended up using Orion for that because Orion can actually work as a Chrome for some reason while still having both Firefox and Chrome extension support. So and I had it installed, but yeah.
This is really amazing and actually a sort of lifesaver actually, so thanks a lot, man.
Now I can actually just speak and this can convert this to text without having to go through any non-local model or Google Docs or whatever anything else.
Why is this so good man? It's so good
man, I actually now am thinking that I had like fully maxed out my typing speed to like hundred-120. But like this can actually write it faster. you know it's pretty amazing actually.
Have a nice day, or as I abbreviate it, HAND, smiley face. :D
edit: holy shit parakeet is good.... Moonshine impressive too and it is half the param
Now if only there was something just as quick as Parakeet v3 for TTS ! Then I can talk to codex all day long!!!
Very lightweight and good quality
If e.g. parakeet can be run on my phone in real time showing the transcript live:
- with latency low enough to be "comfortable enough" for the instructor to keep an eye on and approve the transcribed instructions
[not necessarily every word of the transcript, i.e., a commanded "edit" doesn't need to be applied in the outcome as long as it's nature is otherwise clear enough to not add meaningful amounts of ambiguity to the final "written" instructions]
by glancing at the screen while dictating the explanation (and blurting out any transcription complaints as soon as that's possible without breaking one's own string-of-thought or spoken grammar too much)
, I'd very happily switch to that approach instead of what I was doing.
Bonus if there's a no-bulky-or-expensive-hardware way to accommodate us both speaking over each other so I won't have to _interrupt_ his speaking just to put a clarifying comment (on what he just said) in the transcript for him to see and sign off, where the at least "only" briefly interrupts his thoughts right while he actually reads my transcribed words (he doesn't have to hear them, and it's better if he won't; I can probably get him to put on earmuffs to not hear me louder than he hears his thoughts, and a sufficiently-smoothed SNR meter for specifically his voice should take care him regulating his volume while the earmuffs mute it and I occasionally talk over him)...
i was using assmeblyAI but this is fast and accurate and offline wtf!
I built a macOS dictation app (https://github.com/T0mSIlver/localvoxtral) on top of Voxtral Realtime, and the UX difference between streaming and offline STT is night and day. Words appearing while you're still talking completely changes the feedback loop. You catch errors in real time, you can adjust what you're saying mid-sentence, and the whole thing feels more natural. Going back to "record then wait" feels broken after that.
Curious how Moonshine's streaming latency compares in practice. Do you have numbers on time-to-first-token for the streaming mode? And on the serving side, do any of the integration options expose an OpenAI Realtime-compatible WebSocket endpoint?
I'd love a faster and more accurate option than Whisper, but streamers need something off-the-shelf they can install in their pipeline, like an OBS plugin which can just grab the audio from their OBS audio sources.
I see a couple obvious problems: this doesn't seem to support translation which is unfortunate, that's pretty key for this usecase. Also it only supports one language at a time, which is problematic with how streamers will frequently code-switch while talking to their chat in different languages or on Discord with their gameplay partners. Maybe such a plugin would be able to detect which language is spoken and route to one or the other model as needed?
The minimum useful data for this stuff is a small table of language | WER for dataset
The authors do acknowledge this though and give a slightly too complex way to do this with uv in an example project (FYI, you dont need to source anything if you use uv run)
It's incredible for a live transcription stream - the latency is WOW.
https://www.onresonant.com/
For the open source folks, that's also set up in handy, I think.
There was an issue with a demo but it's missing now. I can't recall for sure but I think I got it working locally myself too but then found it broke unexpectedly and I didn't manage to find out why.
Weird to only release English as open weights.
Due to the sector being increasingly worried about "hybrid threats" we try to rely on the cloud as little as possible and run things either on device or with the possibility of being self-hosted/on-premise. I really like the direction your company is going in in this respect.
We'd probably need custom training -- we need Norwegian, and there's some lingo, e.g., "bravo one two" should become "B-1.2". While that can perhaps also be done with simple post-processing rules, we would also probably want such examples in training for improved recognition? Have no VC funding, but looking forward to getting some income so that we can send some of it in your direction :)
Edit: It was https://muninai.eu (I shut down the backend server yesterday so the functionality is disabled).
For voice agents, the painful failure mode is partials getting rewritten every few hundred ms. If you can share it, metrics like median first-token latency, real-time factor, and "% partial tokens revised after 1s / 3s" on noisy far-field audio would make comparisons much more actionable.
If those numbers look good, this seems very promising for local assistant pipelines.
> This code, apart from the source in core/third-party, is licensed under the MIT License, see LICENSE in this repository.
> The English-language models are also released under the MIT License. Models for other languages are released under the Moonshine Community License, which is a non-commercial license.
> The code in core/third-party is licensed according to the terms of the open source projects it originates from, with details in a LICENSE file in each subfolder.
Presuming (I haven't checked myself) the git author information supports this, it should be fine to treat this as licensing the code it specifies under MIT; based on that license name being (to my understanding) unambiguous and license application being based on contract law and contract law basically having at it's very core the principle of "meeting of the minds" along with wilful infringement being really really hard to even argue for if the only thing that's separating it from being 100% clearly licensed in all proper ways being not copying in an MIT `LICENSE` template with date and author name pasted into it.
The English-language models are also released under the MIT License. Models for other languages are released under the Moonshine Community License, which is a non-commercial license.
The code in core/third-party is licensed according to the terms of the open source projects it originates from, with details in a LICENSE file in each subfolder."
Timestamp 1: 2026-02-25T00:31:28 1771979488 https://news.ycombinator.com/item?id=47145661
Timestamp 2: 2026-02-25T00:32:03 1771979523 https://news.ycombinator.com/item?id=47145666
Two detailed large comments in two different threads in a 35 second span from a new account.