Ask HN: What's the current best local/open speech-to-speech setup?

d4rkp4ttern · 2026-01-24T12:52:09 1769259129

This is not strictly speech-to-speech, but I quite like it when working with Claude Code or other CLI Agents:

STT: Handy [1] (open-source), with Parakeet V3 - stunningly fast, near-instant transcription. The slight accuracy drop relative to bigger models is immaterial when you're talking to an AI. I always ask it to restate back to me what it understood, and it gives back a nicely structured version -- this helps confirm understanding as well as likely helps the CLI agent stay on track.

TTS: Pocket-TTS [2], just 100M params, and amazing speech quality (English only). I made a voice plugin [3] based on this, for Claude Code so it can speak out short updates whenever CC stops. It uses a non-blocking stop hook that calls a headless agent to create the 1/2-sentence summary. Turns out to be surprisingly useful. It's also fun as you can customize the speaking style and mirror your vibe etc.

The voice plugin gives commands to control it:

    /voice:speak stop
    /voice:speak azelma (change the voice)
    /voice:speak <your arbitrary prompt to control the style or other aspects>

[1] Handy https://github.com/cjpais/Handy

[2] Pocket-TTS https://github.com/kyutai-labs/pocket-tts

[3] Voice plugin for Claude Code: https://github.com/pchalasani/claude-code-tools?tab=readme-o...

skrebbel · 2026-01-24T16:41:53 1769272913

Wow Handy works impressively well! Excellent UX too (on Windows at least).

indigodaddy · 2026-01-24T16:20:10 1769271610

Hi, so I'm looking for an stt that can happen on a server/cron, that will use a small local model (I have 4 vCPU threadripper CPU only and 20G ram on the server) and be able to transcribe from remote audio URLs (preferably, but I know that local models probably don't have this feature so will have to do something like curl the audio down to memory or /tmp and then transcribe and then remove the file etc).

Have any thoughts?

d4rkp4ttern · 2026-01-24T22:59:50 1769295590

I’ve no thoughts on that unfortunately.

indigodaddy · 2026-01-24T23:41:20 1769298080

3dsnano · 2026-01-24T14:38:15 1769265495

posts like this are why i visit HN daily!!!

thanks for sharing your knowledge; can’t wait to try out your voice plugin

d4rkp4ttern · 2026-01-24T23:18:48 1769296728

Same!

Feel free to file a gh issue if you have problems with the voice plugin

mpaepper · 2026-01-23T21:50:23 1769205023

You should look into the new Nvidia model: https://research.nvidia.com/labs/adlr/personaplex/

It has dual channel input / output and a very permissible license

zaken · 2026-01-24T05:12:37 1769231557

Oh man that space emergency example had me rolling

albert_e · 2026-01-24T08:57:12 1769245032

Ha --

and the "Customer Service - Banking" scenario claims that it demos "accent control" and the prompt gives the agent a definitely non-indian name, yet the agents sounds 100% Indian - I found that hilarious but also isn't it a bad example given they are claiming accent control as a feature?

mikkupikku · 2026-01-24T12:21:48 1769257308

"Sanni Virtanen", I guess it was meant to be Finnish? Maybe the "bank customer support" part threw the AI off, lmao.

adabyron · 2026-01-24T14:04:46 1769263486

Changing my title to "Astronaut" right now... I'll be using that line as well anytime someone asks me to do something.

hnlmorg · 2026-01-24T10:58:51 1769252331

Oh wow. Thats definitely something…

cbrews · 2026-01-23T23:49:06 1769212146

Thanks for sharing this! I'm going to put this on my list to play around with. I'm not really an expert in this tech, I come from the audio background, but recently was playing around with streaming Speech-to-Text (using Whisper) / Text-to-Speech (using Kokoro at the time) on a local machine.

The most challenging part in my build was tuning the inference batch sizing here. I was able to get it working well for Speech-to-Text down to batch sizes of 200ms. I even implement a basic local agreement algorithm and it was still very fast (inferencing time, I think, was around 10-20ms?). You're basically limited by the minimum batch size, NOT inference time. Maybe that's a missing "secret sauce" suggested in the original post?

In the use case listed above, the TTS probably isn't a bottleneck as long as OP can generate tokens quickly.

All this being said a wrapped model like this that is able to handle hand-offs between these parts of the process sounds really useful and I'll definitely be interested in seeing how it performs.

Let me know if you guys play with this and find success.

dsrtslnd23 · 2026-01-23T21:58:56 1769205536

oh - very interesting indeed! thanks

vulkoingim · 2026-01-24T08:04:55 1769241895

I'm using https://spokenly.app/ in local mode, which is free. Very happy with it. It supports a bunch of models, including whisper and parakeet. Right now I'm mostly using parakeet v3 on my desktop, but it tends to do a bit more errors, although it is very fast. I cycle betwen it and Distil-Whisper Large V3.5, which is a bit slower.

On iOS I'm also using the same app, with the Apple Speech model, which I found out to be better performing for me than the parakeet/whisper. One drawback for the apple model is that you need iOS/Mac 26+ - and I haven't bothered to update to Tahoe on my mac.

Both of the models work instantly for me (Mac M1, iphone 17 Pro).

Edit: Aaaand I just saw that you're looking for speech-to-speech. Oops, still sleeping.

jauntywundrkind · 2026-01-23T21:39:05 1769204345

It was a little annoying getting old qt5 tools installed but I really enjoyed using dsnote / Speech Note. Huge model selection for my amd gpu. Good tool. I haven't done enough specific studying yet to give you suggestions for which model to go with. WhisperFlow is very popular.

Kyutai some very interesting work always. Their delayed streams work is bleeding edge & sounds very promising especially for low latency. Not sure why I have not yet tried it tbh. https://github.com/kyutai-labs/delayed-streams-modeling

There's also a really nice elegant simple app Handy. Only supports Whisper and Parakeet V3 but nice app & those are amazing models. https://github.com/cjpais/Handy

supermatt · 2026-01-24T15:55:49 1769270149

There was a great post the other day showing low latency end to end using Nvidia models on a single GPU with pipecat

Discussion: https://news.ycombinator.com/item?id=46528045

Article: https://www.daily.co/blog/building-voice-agents-with-nvidia-...

timwis · 2026-01-24T08:40:42 1769244042

Home Assistant have a fully local voice assistant experience that's very pluggable and customisable. I believe it uses a fast whisper model for STT and piper for TTS.

You can run it on a raspberry pi (or ideally an N100+), and for the microphone/speaker part, you can make your own or buy their off the shelf voice hardware, which works really well.

https://www.home-assistant.io/voice-pe/

stavros · 2026-01-24T10:46:42 1769251602

Unfortunately I didn't manage to figure out how to make their hardware to work without a HA installation. I'd really love to do that, if anyone has any info on how their protocol works, please do tell.

I looked at their Wyoming docs online but couldn't really see how to even let it find the server, and the ESPhome firmware it runs offered similarly few hints.

dfajgljsldkjag · 2026-01-24T00:53:23 1769216003

It requires a bit of tinkering, but I think pipecat is the way to go. You can plug in pretty much any STT/LLM/TTS you want and go. It definitely supports local models but its up to you to get your hands on those models.

Not sure if there's any turnkey setups that are preconfigured for local install where you can just press play and go though.

Last I heard E2E speech to speech models are still pretty weak. I've had pretty bad results from gpt-realtime and that's a proprietary model, I'm assuming open source is a bit behind.

storystarling · 2026-01-24T15:35:48 1769268948

I suspect the glued pipeline is going to remain dominant for a while, mostly because the intermediate text layer is structural, not just a byproduct. If you drop the text for a pure E2E model, you suddenly lose the ability to easily inject RAG context or handle complex tool use. I've been building some agent workflows recently and having that text state to pass into something like LangGraph is the only way to reliably control the logic. Without it, you are basically flying blind on the backend.

gunalx · 2026-01-24T21:33:52 1769290432

Yep, this is something end tl end models need to solve to be ideal I think. I hve seen a split brain architecture with one speaking and one thinking brain. If the thinking one could have some text tokens as output and input, to be able to refine on reasoning and rag+tools and the audio brain doing parallel audio decode.

dsrtslnd23 · 2026-01-24T07:45:18 1769240718

yes, I am currently playing with pipecat - both with ASR + LLM + TTS pipeline and also speech to text (ultravox) + TTS but haven't been successful with local speech to speech setups yet.

amelius · 2026-01-23T23:35:14 1769211314

For the TTS part: https://github.com/supertone-inc/supertonic

nsbk · 2026-01-24T11:20:54 1769253654

I'm putting together a streaming ASR + LLM + streaming TTS setup based on Nvidia speech models: nemotron ASR and magpie TTS, pipecat to glue everything together, plus an LLM of your choice. I added Spanish support using canary models, as magpie models are English-only and it still works really well.

The work is based on a repo by pipecat that I forked and modified to be more comfortable to run (docker compose for the server and client), added Spanish support via canary models, and added Nvidia Ampere support so it can run on my 3090.

The use case is a conversation partner for my gf who is learning Spanish, and it works incredibly well. For LLM I settled with Mistral-Small-3.2-24B-Instruct-2506-Q4_K_S.gguf

https://github.com/nsbk/nemotron-january-2026

nsbk · 2026-01-25T09:46:33 1769334393

I got the models all the way around. Nemotron-speech ASR is the one that is English-only. Magpie TTS is multilingual and can do both English and Spanish

soulofmischief · 2026-01-24T04:30:01 1769229001

I have a great local assistant that works end-to-end with voice. It's built on local, web-first technologies, it fits small LLMs in memory and manages inference and TTS/STT without stuttering. I've been shaping it up over a couple years and constantly switching out new models.

If you want something simple that runs in browser, look at vosk-browser[0] and vits-web[1].

I'd also recommend checking out KittenTTS[2], I use it and it's great for the size/performance. However, you'd need to implement a custom JavaScript harness for the model since it's a python project. If you need help with that, shoot me an email and I can share some code.

There are other great approaches too if you don't mind python, personally I chose the web as a platform in order to make my agent fully portable and remote once I release it.

And of course, NVIDIA's new model just came out last week[3] but I haven't gotten to test it out just yet, and also there was the recent Sparrow-1[4] announcement which shows people are finally putting money into the problems plaguing voice agents that are rigged up from several models and glue infrastructure, vs a single end-to-end model or at least a conversational turn-taking model to keep things on rails.

[0] https://www.npmjs.com/package/vosk-browser

[1] https://github.com/diffusionstudio/vits-web

[2] https://github.com/KittenML/KittenTTS

[3] https://research.nvidia.com/labs/adlr/personaplex/

[4] https://www.tavus.io/post/sparrow-1-human-level-conversation...

andhuman · 2026-01-24T07:52:20 1769241140

I built this recently. I used nvidia parakeet as STT, open wake word as the wake word detection, mistral ministral 14b as LLM and pocket tts for tts. Fits snugly in my 16 gb VRAM. Pocket is small and fast and has good enough voice cloning. I first used the chatterbox turbo model, which perform better and even supported some simple paralinguistic word like (chuckle) that made it more fun, but it was just a bit too big for my rig.

PhilippGille · 2026-01-24T08:12:20 1769242340

OP asked:

> Is anyone doing true end-to-end speech models locally (streaming audio out), or is the SOTA still “streaming ASR + LLM + streaming TTS” glued together?

Your setup is the latter, not the former.

schobi · 2026-01-24T08:32:23 1769243543

Oh... Having a local-only voice assistant would be great. Maybe someone can share the practical side of this.

Do you have the GPU running all day at 200W to scan for wake words? Or is that running on the machine you are working on anyway?

Is this running from a headset microphone (while sitting at the desk?) or more like a USB speakerphone? Is there an Alexa jailbreak / alternative firmware as a frontend and run this on a GPU hidden away?

butvacuum · 2026-01-24T11:57:58 1769255878

Wake words are generally processed extremely early in the pipeline. So if you capture audio with, say, an ESP32 the uC does the wale word watching.

Theres even microphone ADCs and DSPs(if you use a mic that outputs PCM/i2S instead of analog) that do the processing internally.

dwa3592 · 2026-01-24T18:02:41 1769277761

I recently trained a stt model which detects about 40 words- the model is less than <hold your breath> 50 kiolobytes. it can run on a <$1 chip.

marsbars241 · 2026-01-24T01:49:11 1769219351

Tangential: What hardware are you using for the interface on these? Is there a good array microphone that performs on par with echos/ghomes/homepods?

doonielk · 2026-01-24T06:33:04 1769236384

I did a MLX "streaming ASR + LLM + streaming TTS" pipeline in early 2024. I haven't worked on it since then so it's dated. There are now better versions of all the models I used.

I was able to conversational latency with the ability to interrupt the pipeline on a Mac, using a variety of tricks. It's MLX, so only relevant if you have a Mac.

https://github.com/andrewgph/local_voice

For MLX speech to speech, I've seen:

The mlx-audio package has some MLX implementations of speech to speech models: https://github.com/Blaizzy/mlx-audio/tree/main

kyutai Moshi, maybe old now but has a MLX implementation of their speech to speech model: https://github.com/kyutai-labs/moshi

zahlman · 2026-01-24T06:55:35 1769237735

What exactly do you want the pipeline to do that cares about the input being "speech", or indeed that's different from just sending mic -> speaker directly? (I can imagine a few different things, but I want to figure out if your use case sounds like mine, or what suggestions are appropriate for what tasks.)

beauregardener · 2026-01-24T18:28:09 1769279289

I had this need recently and didn't want to use resource hungry models... so I created this:

https://github.com/caseys/hear-say

varik77 · 2026-01-24T01:52:24 1769219544

I have used https://github.com/SaynaAI/sayna . What I like the most is that you can switch between the providers easily and see what works for you the best. It also supports local models.

sails · 2026-01-24T07:32:25 1769239945

Looking for an iOS app to test this as I’m generally curious about the capabilities of on devices TTS (yet to find an app, but there are loads for text gen)

It can’t be too far off considering Siri and TTS has been on devices for ages

sgt · 2026-01-24T09:30:29 1769247029

While on this subject, what's the go to transcribe speech to text model (open source or proprietary, doesn't matter) if you have to support a lot of languages really well?

nemima · 2026-01-24T09:43:14 1769247794

If propeietary/SaaS fits your use case I can reccomend Speechmatics. Has a wider range of languages than a lot of the competition: https://speechmatics.com

(Full disclosure I'm an engineer there)

sgt · 2026-01-24T12:51:49 1769259109

Will it work with say - someone speaking English with some hindi mixed in? I'm not from there so I'm not sure how prevalent that is, but I've been told it's quite common to "mix it up" in India, and I need to probably cater for that use case.

PS if you can share your email I'll pop you an email about Speechmatics. I tried the English version and it's impressive.

nemima · 2026-01-24T13:27:08 1769261228

This is definitely the sort of use case we aim to support! I would need to check about Hindi specifically, but we have several bilingual models already with more to come:

https://docs.speechmatics.com/speech-to-text/languages#trans...

Drop me an email at mattn@speechmatics.com and we can chat about further details :)

dvfjsdhgfv · 2026-01-24T10:38:25 1769251105

I spent a few days on similar scenario without much success (scenario where one person speaks and then their speech is translated, and I want juts the original or both).

An API call to GPT4o works quite well (it basically handles both transcription and diarization), but I wanted a local model.

Whisper is really good for 1 person speaking. With more people you get repetitions. Qwen and other open multimodal models gives subpar results.

I tried multipass approach, with the first one identifying the language and chunking and the next one the actual transcription, but this tended to miss a lot of content.

I'm going to give canary-1b-v2 a try next weekend. But it looks like in spite of enormous development in other areas, speech recognition stalled since Whisper's release (more than 3 years already?).

ripped_britches · 2026-01-24T04:29:26 1769228966

speech to speech is not nearly as good as livekit IMO ("old school" sequence of transcribe, LLM, synthesize). depends on what you're doing of course, but this is just because the LLMs are just way smarter than the speech to speech models which are pretty much the worst (again IMO) at anything beyond basic banter. and livekit is just a framework so you can hook it up with any models in the stack. im not an expert on the local parts but i would assume this pretty easy to glue together.

vidarh · 2026-01-24T09:43:13 1769247793

They work for two entirely different things. The problem with these pipelines is that unless the latency is very low they simply aren't suitable replacements for Alexa etc. For that use case, low latency beats smarts.

ripped_britches · 2026-01-24T14:56:33 1769266593

The latency is very very low in my experience, it would definitely work well as an Alexa style assistant

DANmode · 2026-01-24T01:07:42 1769216862

https://handy.computer got good marks from a very nontechnical user in my life this week!

Local, FOSS

benatkin · 2026-01-24T03:50:57 1769226657

To save a click, it's just a fancy front end for Whisper plus a weaker CPU-only model. It has a demo video that seems impressive, but the speech is careful to sound casual while having no meaningful flaws that would cause it to mess up. If you want to make a speech to speech tool, which is what this post asks about, it would make more sense to go straight to Whisper.

joshribakoff · 2026-01-24T05:26:36 1769232396

I use it, sponsor it, and did a small pr. One of its goals is to be the most “forkable” starting point if i recall. But yes its just voice input. It’s meaningfully better than the mac dictation for me.

tuananh · 2026-01-24T05:06:45 1769231205

you can use gpu too. i have to admit the app is very easy to use and super convenient. kudos to creator

benatkin · 2026-01-24T05:47:05 1769233625

Yes, and with GPU, it's Whisper, which has been mentioned elsewhere in this article's comments. I mean that handy.computer provides the other option as a fallback for those who can't or don't want to use the GPU.

hedgehog · 2026-01-24T02:04:28 1769220268

I haven't tried them myself but the Kyutai has a couple projects that could fit.

https://kyutai.org

Johnny_Bonk · 2026-01-23T23:29:25 1769210965

Anyone using any reasonably good small speech to text os models?

woudsma · 2026-01-24T08:59:51 1769245191

I’m using whisper with superwhisper on my mac. I’ve assigned a key on my keyboard, when I press the key it starts listening and when I release it, the text gets copied to the current cursor location. It works pretty well.

d4rkp4ttern · 2026-01-24T12:56:59 1769259419

Parakeet V3 is near-instant transcription, and the slight accuracy drop relative to the slower/bigger Whisper models is immaterial when talking to AIs that can “read between the lines”.

garblegarble · 2026-01-23T23:32:57 1769211177

For my inputs, whisper distil-large-v3.5 is the best. I tried Parakeet 0.6 v3 last night but it has higher error rates than I'd like (but it is fast...)

Johnny_Bonk · 2026-01-23T23:35:16 1769211316

Nice I'll try it, as of now for my personal stt workflow I use eleven labs api which is pretty generous but curious to play around with other options

garblegarble · 2026-01-23T23:46:55 1769212015

I assume that will be better than whisper - I haven't benchmarked it against cloud models, the project I'm working on cannot send data out to cloud models

BiraIgnacio · 2026-01-23T23:52:38 1769212358

oh I've been looking into whisper and vosk in the last few days. I'll probably go with whisper (with whisper.cpp) but has anyone compared it to vosk models?

masardigital · 2026-01-24T13:50:26 1769262626