Your Smart Home Is Phoning Home. Let’s Fix That.
Every time you say “Hey Alexa, turn off the living room lights,” that audio clip takes a round trip to Amazon’s servers, gets processed, gets logged, gets used to train models, and then — eventually — your lights turn off. Cool feature. Bad deal.
Google Home does the same thing. Apple HomeKit is better about privacy but still leans on iCloud for a lot of the heavy lifting. The dirty secret of “smart home” assistants is that the smart part lives somewhere else, and you’re just renting access to it.
Here’s the thing: the tools to run this entirely yourself have been production-ready for a couple of years now, and in 2026 they’re genuinely good. We’re talking sub-second response times on modest hardware, natural-sounding TTS voices, and tight Home Assistant integration. No cloud. No subscription. No data leaking out of your LAN.
The stack: Wyoming protocol to wire everything together, Whisper for speech-to-text, Piper for text-to-speech, and Home Assistant Assist (with optional small LLM) for the brains. Let’s build it.
Full example: Clone the working Compose files at github.com/KingPin/sumguy-examples/self-hosting/local-voice-assistant-whisper-piper-ha/
The Architecture: Wyoming Is the Glue
If you’ve seen the older Whisper STT and Piper TTS articles here, you already have the individual pieces. This is the integration story — how they talk to Home Assistant and to your microphone hardware.
Wyoming is a simple TCP-based protocol that Home Assistant uses to talk to satellite services. Each service exposes a socket, Home Assistant connects to it, and they exchange audio frames and transcription events. That’s it. No message queue, no REST API, no ceremony. It’s the “just pipe it over TCP” solution that the community actually rallied around after Rhasspy got long in the tooth.
The full data flow looks like this:
ESP32 / Atom Echo / Respeaker mic ↓ (audio stream)Wyoming Satellite (on the device or on a Pi) ↓ (Wyoming protocol over TCP)Whisper STT server → transcribed text ↓Home Assistant Assist → intent matching / LLM ↓Piper TTS server → audio response ↓Wyoming Satellite → speaker outputThe Compose stack runs Whisper, Piper, and optionally Music Assistant on your main server. Home Assistant connects to those services via Wyoming. Your microphone hardware just needs to be on the same LAN.
Hardware: Pick Your Mic
You’ve got a few good options depending on your budget and how DIY you want to go:
ESP32 Voice PE / Atom Echo — M5Stack’s Voice PE is probably the easiest entry point. It’s an ESP32-S3 with a dual-mic array and a speaker, runs ESPHome, and speaks Wyoming natively. Flash it, configure your HA server address, done. The Atom Echo is smaller and cheaper but mono mic — fine for quiet rooms, annoying in a kitchen.
Respeaker USB Mic Array — Seeed’s Respeaker 4-Mic Array is USB, works on any Pi or x86 box running wyoming-satellite. Good directional capture. If you already have a Pi sitting somewhere, this is the lazy path.
Wyoming Satellite on a Pi — For a custom enclosure or existing Pi 3B+/4/5, rhasspy/wyoming-satellite runs as a Python service, takes a USB or I2S mic as input, and connects back to your HA instance. A Pi Zero 2W + Respeaker 2-Mic HAT fits in an Altoids tin if that’s your thing.
The Pi 5 + Google Coral USB accelerator combination is worth calling out specifically: the Coral can offload Whisper inference to its TPU, dropping latency to under 200ms on the tiny model. Overkill for most people, genuinely satisfying if you care about response times.
The Compose Stack
Here’s the full stack. Adjust WHISPER_MODEL based on your hardware — tiny.en runs on anything, base.en is the sweet spot for accuracy vs. speed on modern CPUs, small.en if you have a GPU or a Pi 5 + Coral.
services: whisper: image: rhasspy/wyoming-whisper container_name: wyoming-whisper restart: unless-stopped ports: - "10300:10300" volumes: - whisper_data:/data environment: - WHISPER_MODEL=base.en - WHISPER_LANGUAGE=en command: --uri tcp://0.0.0.0:10300 --model base.en --language en
piper: image: rhasspy/wyoming-piper container_name: wyoming-piper restart: unless-stopped ports: - "10200:10200" volumes: - piper_data:/data command: --uri tcp://0.0.0.0:10200 --voice en_US-lessac-medium
openwakeword: image: rhasspy/wyoming-openwakeword container_name: wyoming-openwakeword restart: unless-stopped ports: - "10400:10400" volumes: - oww_data:/data command: --uri tcp://0.0.0.0:10400 --preload-model ok_nabu
music-assistant: image: ghcr.io/music-assistant/server:latest container_name: music-assistant restart: unless-stopped network_mode: host volumes: - music_data:/data privileged: true
volumes: whisper_data: piper_data: oww_data: music_data:A few notes on that config:
en_US-lessac-mediumis the best-sounding Piper voice for English without going huge. Thehighquality variant sounds better but takes longer to synthesize — on a Pi 4 you’ll feel it. Swap toen_US-ryan-highif you’re on a real CPU box.ok_nabuis Home Assistant’s built-in wake word. It’s… fine. You’ll train a custom one later (covered below).- Music Assistant gets
network_mode: hostbecause it needs mDNS and Chromecast discovery. Yes, it’s ugly. Yes, it’s necessary.
Spin it up:
docker compose up -ddocker compose logs -f whisper # watch for model downloadFirst boot downloads the Whisper model — base.en is about 140MB. Piper voices download lazily on first use.
Wiring It Into Home Assistant
Once your services are running, add them in HA under Settings → Voice Assistants → Add Assistant.
# Explicit Wyoming service config if autodiscovery doesn't catch it# Usually not needed — HA finds them automatically on the local network
# Optional: set Assist as default pipelineassist_pipeline: debug_recording_dir: /tmp/assist_debugIn the UI: Settings → Voice Assistants → Create pipeline. Select:
- Speech-to-text: Whisper (your local instance at
<host>:10300) - Text-to-speech: Piper (
<host>:10200) - Wake word: openWakeWord (
<host>:10400) - Conversation agent: Home Assistant (built-in) or your LLM agent
The “Conversation agent” is where the intent matching happens. Home Assistant’s built-in Assist handles a solid set of commands — lights, switches, covers, climate, media — without needing any LLM at all. It’s fast, deterministic, and works offline. If you want free-form queries (“what’s the energy usage this week?” or “is anyone home?”), wire in a local LLM via the Ollama integration instead.
Custom Wake Words with openWakeWord
ok_nabu is recognizable but not great. openWakeWord lets you train custom wake words on ~30 seconds of your own voice using a Google Colab notebook (ironic, but free).
The trained model drops in as a .tflite file:
openwakeword: image: rhasspy/wyoming-openwakeword volumes: - oww_data:/data - ./custom_wakewords:/custom # mount your .tflite files here command: > --uri tcp://0.0.0.0:10400 --custom-model-dir /custom --preload-model hey_jarvisCommon community-trained models: hey_jarvis, alexa (yes, you can repurpose Alexa’s wake word for your local stack — petty but satisfying), hey_mycroft. Check the openWakeWord GitHub for the model zoo.
Accuracy tuning: if you’re getting false positives, increase --threshold (default 0.5, try 0.7). If it misses triggers, lower it. The sweet spot depends on your room acoustics and how much HVAC noise you’re dealing with.
Latency: What to Actually Expect
On a modern CPU box (Ryzen 5 or better, 2020+):
- Wake word detection: ~100ms
- Whisper
base.entranscription: ~200-400ms for a 2-3 second utterance - HA intent matching: ~50ms
- Piper TTS synthesis: ~150-300ms
- Total round-trip: ~500-800ms — comparable to Alexa on a good day
On a Pi 4 (4GB):
- Whisper
tiny.en: ~400-600ms - Piper medium: ~400-500ms
- Total: 1-2 seconds — usable, occasionally annoying
On a Pi 5 + Google Coral USB:
- Whisper runs on TPU: ~150ms even on
base.en - Piper still on CPU: ~300ms
- Total: ~500ms — fast enough that it feels snappy
GPU box (RTX 3060 or better):
- Enable
--device cudaon the Whisper container - Transcription drops to ~50-100ms
- Total: under 300ms — you’ll forget it’s local
If you’re on a Pi 4 and the latency is killing you, drop to tiny.en. Accuracy takes a hit on accented speech and mumbling, but for home automation commands (“turn off the bedroom light”) it’s totally fine. The vocabulary of home control is small and well-constrained.
Multilingual Support
Whisper is genuinely multilingual — it handles 99 languages including decent support for Spanish, French, German, Portuguese, Japanese, and Mandarin. Switch the model from base.en to base (the multilingual variant) and set the language explicitly:
environment: - WHISPER_MODEL=base - WHISPER_LANGUAGE=es # ISO 639-1 language code command: --uri tcp://0.0.0.0:10300 --model base --language esFor Piper, swap the voice model to the appropriate language. The Piper voice list covers 40+ languages — check rhasspy.github.io/piper-samples for samples before committing to a voice. Quality varies significantly by language; English and German have the best models right now.
Home Assistant Assist’s intent matching is language-aware too — set your assistant’s language in the pipeline config and it’ll route to the right intent parser.
Rhasspy vs. Wyoming: What Happened?
If you’ve been in the home automation space for a while, you remember Rhasspy — the original fully-offline voice assistant platform. It’s still alive (v2.5.x), still works, and still has a dedicated community. But Wyoming is where active development is happening.
The difference in practice: Rhasspy is a monolithic application that packages STT, TTS, NLU, and wake word detection together. Wyoming is a protocol — a set of socket interfaces that let you mix and match services however you want. Whisper is a Wyoming service. Piper is a Wyoming service. Your wake word detector is a Wyoming service. Home Assistant speaks Wyoming natively.
If you have a working Rhasspy setup, there’s no urgency to migrate. But for new installs in 2026, Wyoming is the modern path. It’s more modular, better maintained, and HA integration is first-class.
The Part Where Amazon Doesn’t Get Your Grocery List
Honestly, this is the thing that sold me. Every voice command you issue stays on your hardware. Your shopping list, your medication reminders, your morning briefings — none of it leaves your LAN. There’s no “skill” to enable, no premium tier, no sudden “we’re changing the API” deprecation notice at 2 AM.
The trade-off is real: setup takes an afternoon instead of five minutes, and you’ll spend some time tuning wake word sensitivity and Piper voice settings. But once it’s running, it just works. And when something breaks, it breaks locally where you can actually debug it.
Your Alexa has been listening. Time to evict it.
Quick Reference
| Hardware | Recommended Model | Expected Latency |
|---|---|---|
| Pi 4 (4GB) | tiny.en | 1–2s |
| Pi 5 + Coral USB | base.en | ~500ms |
| x86 CPU (modern) | base.en | 500–800ms |
| x86 + NVIDIA GPU | small.en | <300ms |
Key ports:
- Whisper STT:
10300 - Piper TTS:
10200 - openWakeWord:
10400 - Music Assistant:
8095(web UI)
For the full Compose stack, configuration examples, and custom wake word training notes, grab the files from the examples repo linked at the top of this article. Your 2 AM self will appreciate having everything in one place.