Skip to content
Go back

Local Voice Assistant: Whisper + Piper + Home Assistant

By SumGuy 10 min read
Local Voice Assistant: Whisper + Piper + Home Assistant

Your Smart Home Is Phoning Home. Let’s Fix That.

Every time you say “Hey Alexa, turn off the living room lights,” that audio clip takes a round trip to Amazon’s servers, gets processed, gets logged, gets used to train models, and then — eventually — your lights turn off. Cool feature. Bad deal.

Google Home does the same thing. Apple HomeKit is better about privacy but still leans on iCloud for a lot of the heavy lifting. The dirty secret of “smart home” assistants is that the smart part lives somewhere else, and you’re just renting access to it.

Here’s the thing: the tools to run this entirely yourself have been production-ready for a couple of years now, and in 2026 they’re genuinely good. We’re talking sub-second response times on modest hardware, natural-sounding TTS voices, and tight Home Assistant integration. No cloud. No subscription. No data leaking out of your LAN.

The stack: Wyoming protocol to wire everything together, Whisper for speech-to-text, Piper for text-to-speech, and Home Assistant Assist (with optional small LLM) for the brains. Let’s build it.

Full example: Clone the working Compose files at github.com/KingPin/sumguy-examples/self-hosting/local-voice-assistant-whisper-piper-ha/


The Architecture: Wyoming Is the Glue

If you’ve seen the older Whisper STT and Piper TTS articles here, you already have the individual pieces. This is the integration story — how they talk to Home Assistant and to your microphone hardware.

Wyoming is a simple TCP-based protocol that Home Assistant uses to talk to satellite services. Each service exposes a socket, Home Assistant connects to it, and they exchange audio frames and transcription events. That’s it. No message queue, no REST API, no ceremony. It’s the “just pipe it over TCP” solution that the community actually rallied around after Rhasspy got long in the tooth.

The full data flow looks like this:

ESP32 / Atom Echo / Respeaker mic
↓ (audio stream)
Wyoming Satellite (on the device or on a Pi)
↓ (Wyoming protocol over TCP)
Whisper STT server → transcribed text
Home Assistant Assist → intent matching / LLM
Piper TTS server → audio response
Wyoming Satellite → speaker output

The Compose stack runs Whisper, Piper, and optionally Music Assistant on your main server. Home Assistant connects to those services via Wyoming. Your microphone hardware just needs to be on the same LAN.


Hardware: Pick Your Mic

You’ve got a few good options depending on your budget and how DIY you want to go:

ESP32 Voice PE / Atom Echo — M5Stack’s Voice PE is probably the easiest entry point. It’s an ESP32-S3 with a dual-mic array and a speaker, runs ESPHome, and speaks Wyoming natively. Flash it, configure your HA server address, done. The Atom Echo is smaller and cheaper but mono mic — fine for quiet rooms, annoying in a kitchen.

Respeaker USB Mic Array — Seeed’s Respeaker 4-Mic Array is USB, works on any Pi or x86 box running wyoming-satellite. Good directional capture. If you already have a Pi sitting somewhere, this is the lazy path.

Wyoming Satellite on a Pi — For a custom enclosure or existing Pi 3B+/4/5, rhasspy/wyoming-satellite runs as a Python service, takes a USB or I2S mic as input, and connects back to your HA instance. A Pi Zero 2W + Respeaker 2-Mic HAT fits in an Altoids tin if that’s your thing.

The Pi 5 + Google Coral USB accelerator combination is worth calling out specifically: the Coral can offload Whisper inference to its TPU, dropping latency to under 200ms on the tiny model. Overkill for most people, genuinely satisfying if you care about response times.


The Compose Stack

Here’s the full stack. Adjust WHISPER_MODEL based on your hardware — tiny.en runs on anything, base.en is the sweet spot for accuracy vs. speed on modern CPUs, small.en if you have a GPU or a Pi 5 + Coral.

docker-compose.yml
services:
whisper:
image: rhasspy/wyoming-whisper
container_name: wyoming-whisper
restart: unless-stopped
ports:
- "10300:10300"
volumes:
- whisper_data:/data
environment:
- WHISPER_MODEL=base.en
- WHISPER_LANGUAGE=en
command: --uri tcp://0.0.0.0:10300 --model base.en --language en
piper:
image: rhasspy/wyoming-piper
container_name: wyoming-piper
restart: unless-stopped
ports:
- "10200:10200"
volumes:
- piper_data:/data
command: --uri tcp://0.0.0.0:10200 --voice en_US-lessac-medium
openwakeword:
image: rhasspy/wyoming-openwakeword
container_name: wyoming-openwakeword
restart: unless-stopped
ports:
- "10400:10400"
volumes:
- oww_data:/data
command: --uri tcp://0.0.0.0:10400 --preload-model ok_nabu
music-assistant:
image: ghcr.io/music-assistant/server:latest
container_name: music-assistant
restart: unless-stopped
network_mode: host
volumes:
- music_data:/data
privileged: true
volumes:
whisper_data:
piper_data:
oww_data:
music_data:

A few notes on that config:

Spin it up:

Terminal window
docker compose up -d
docker compose logs -f whisper # watch for model download

First boot downloads the Whisper model — base.en is about 140MB. Piper voices download lazily on first use.


Wiring It Into Home Assistant

Once your services are running, add them in HA under Settings → Voice Assistants → Add Assistant.

configuration.yaml
# Explicit Wyoming service config if autodiscovery doesn't catch it
# Usually not needed — HA finds them automatically on the local network
# Optional: set Assist as default pipeline
assist_pipeline:
debug_recording_dir: /tmp/assist_debug

In the UI: Settings → Voice Assistants → Create pipeline. Select:

The “Conversation agent” is where the intent matching happens. Home Assistant’s built-in Assist handles a solid set of commands — lights, switches, covers, climate, media — without needing any LLM at all. It’s fast, deterministic, and works offline. If you want free-form queries (“what’s the energy usage this week?” or “is anyone home?”), wire in a local LLM via the Ollama integration instead.


Custom Wake Words with openWakeWord

ok_nabu is recognizable but not great. openWakeWord lets you train custom wake words on ~30 seconds of your own voice using a Google Colab notebook (ironic, but free).

The trained model drops in as a .tflite file:

docker-compose.yml
openwakeword:
image: rhasspy/wyoming-openwakeword
volumes:
- oww_data:/data
- ./custom_wakewords:/custom # mount your .tflite files here
command: >
--uri tcp://0.0.0.0:10400
--custom-model-dir /custom
--preload-model hey_jarvis

Common community-trained models: hey_jarvis, alexa (yes, you can repurpose Alexa’s wake word for your local stack — petty but satisfying), hey_mycroft. Check the openWakeWord GitHub for the model zoo.

Accuracy tuning: if you’re getting false positives, increase --threshold (default 0.5, try 0.7). If it misses triggers, lower it. The sweet spot depends on your room acoustics and how much HVAC noise you’re dealing with.


Latency: What to Actually Expect

On a modern CPU box (Ryzen 5 or better, 2020+):

On a Pi 4 (4GB):

On a Pi 5 + Google Coral USB:

GPU box (RTX 3060 or better):

If you’re on a Pi 4 and the latency is killing you, drop to tiny.en. Accuracy takes a hit on accented speech and mumbling, but for home automation commands (“turn off the bedroom light”) it’s totally fine. The vocabulary of home control is small and well-constrained.


Multilingual Support

Whisper is genuinely multilingual — it handles 99 languages including decent support for Spanish, French, German, Portuguese, Japanese, and Mandarin. Switch the model from base.en to base (the multilingual variant) and set the language explicitly:

docker-compose.yml
environment:
- WHISPER_MODEL=base
- WHISPER_LANGUAGE=es # ISO 639-1 language code
command: --uri tcp://0.0.0.0:10300 --model base --language es

For Piper, swap the voice model to the appropriate language. The Piper voice list covers 40+ languages — check rhasspy.github.io/piper-samples for samples before committing to a voice. Quality varies significantly by language; English and German have the best models right now.

Home Assistant Assist’s intent matching is language-aware too — set your assistant’s language in the pipeline config and it’ll route to the right intent parser.


Rhasspy vs. Wyoming: What Happened?

If you’ve been in the home automation space for a while, you remember Rhasspy — the original fully-offline voice assistant platform. It’s still alive (v2.5.x), still works, and still has a dedicated community. But Wyoming is where active development is happening.

The difference in practice: Rhasspy is a monolithic application that packages STT, TTS, NLU, and wake word detection together. Wyoming is a protocol — a set of socket interfaces that let you mix and match services however you want. Whisper is a Wyoming service. Piper is a Wyoming service. Your wake word detector is a Wyoming service. Home Assistant speaks Wyoming natively.

If you have a working Rhasspy setup, there’s no urgency to migrate. But for new installs in 2026, Wyoming is the modern path. It’s more modular, better maintained, and HA integration is first-class.


The Part Where Amazon Doesn’t Get Your Grocery List

Honestly, this is the thing that sold me. Every voice command you issue stays on your hardware. Your shopping list, your medication reminders, your morning briefings — none of it leaves your LAN. There’s no “skill” to enable, no premium tier, no sudden “we’re changing the API” deprecation notice at 2 AM.

The trade-off is real: setup takes an afternoon instead of five minutes, and you’ll spend some time tuning wake word sensitivity and Piper voice settings. But once it’s running, it just works. And when something breaks, it breaks locally where you can actually debug it.

Your Alexa has been listening. Time to evict it.


Quick Reference

HardwareRecommended ModelExpected Latency
Pi 4 (4GB)tiny.en1–2s
Pi 5 + Coral USBbase.en~500ms
x86 CPU (modern)base.en500–800ms
x86 + NVIDIA GPUsmall.en<300ms

Key ports:

For the full Compose stack, configuration examples, and custom wake word training notes, grab the files from the examples repo linked at the top of this article. Your 2 AM self will appreciate having everything in one place.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it'll show up above once verified.


Next Post
iperf3 + nload: Network Diagnosis

Discussion

Powered by Garrul . Sign in with GitHub or Google, or post anonymously.

Related Posts