I’m trying to do the “voice assistant” thing fully locally: mic → model → speaker, low latency, ideally streaming + interruptible (barge-in).
Qwen3 Omni looks perfect on paper (“real-time”, speech-to-speech, etc). But I’ve been poking around and I can’t find a single reproducible “here’s how I got the open weights doing real speech-to-speech locally” writeup. Lots of “speech in → text out” or “audio out after the model finishes”, but not a usable realtime voice loop. Feels like either (a) the tooling isn’t there yet, or (b) I’m missing the secret sauce.
What are people actually using in 2026 if they want open + local voice?
Is anyone doing true end-to-end speech models locally (streaming audio out), or is the SOTA still “streaming ASR + LLM + streaming TTS” glued together?
If you did get Qwen3 Omni speech-to-speech working: what stack (transformers / vLLM-omni / something else), what hardware, and is it actually realtime?
What’s the most “works today” combo on a single GPU?
Bonus: rough numbers people see for mic → first audio back
Would love pointers to repos, configs, or “this is the one that finally worked for me” war stories.
STT: Handy [1] (open-source), with Parakeet V3 - stunningly fast, near-instant transcription. The slight accuracy drop relative to bigger models is immaterial when you're talking to an AI. I always ask it to restate back to me what it understood, and it gives back a nicely structured version -- this helps confirm understanding as well as likely helps the CLI agent stay on track.
TTS: Pocket-TTS [2], just 100M params, and amazing speech quality (English only). I made a voice plugin [3] based on this, for Claude Code so it can speak out short updates whenever CC stops. It uses a non-blocking stop hook that calls a headless agent to create the 1/2-sentence summary. Turns out to be surprisingly useful. It's also fun as you can customize the speaking style and mirror your vibe etc.
The voice plugin gives commands to control it:
[1] Handy https://github.com/cjpais/Handy[2] Pocket-TTS https://github.com/kyutai-labs/pocket-tts
[3] Voice plugin for Claude Code: https://github.com/pchalasani/claude-code-tools?tab=readme-o...
reply