Local Voice Assistant: Whisper + Piper + Home Assistant

Your Smart Home Is Phoning Home. Let’s Fix That.

Every time you say “Hey Alexa, turn off the living room lights,” that audio clip takes a round trip to Amazon’s servers, gets processed, gets logged, gets used to train models, and then, eventually, your lights turn off. Cool feature. Bad deal.

Google Home does the same thing. Apple HomeKit is better about privacy but still leans on iCloud for a lot of the heavy lifting. The dirty secret of “smart home” assistants is that the smart part lives somewhere else, and you’re just renting access to it.

The tools to run this entirely yourself have been production-ready for a couple of years now, and in 2026 they’re genuinely good. We’re talking sub-second response times on modest hardware, natural-sounding TTS voices, and tight Home Assistant integration. No cloud. No subscription. No data leaking out of your LAN.

The stack: Wyoming protocol to wire everything together, Whisper for speech-to-text, Piper for text-to-speech, and Home Assistant Assist (with optional small LLM) for the brains. Let’s build it.

Full example: Clone the working Compose files at github.com/KingPin/sumguy-examples/self-hosting/local-voice-assistant-whisper-piper-ha/

The Architecture: Wyoming Is the Glue

If you’ve seen the older Whisper STT and Piper TTS articles here, you already have the individual pieces. This is the integration story: how they talk to Home Assistant and to your microphone hardware.

Wyoming is a simple TCP-based protocol that Home Assistant uses to talk to satellite services. Each service exposes a socket, Home Assistant connects to it, and they exchange audio frames and transcription events. That’s it. No message queue, no REST API, no ceremony. It’s the “just pipe it over TCP” solution that the community actually rallied around after Rhasspy got long in the tooth.

The full data flow looks like this:

ESP32 / Atom Echo / Respeaker mic
         ↓  (audio stream)
Wyoming Satellite (on the device or on a Pi)
         ↓  (Wyoming protocol over TCP)
Whisper STT server  →  transcribed text
         ↓
Home Assistant Assist  →  intent matching / LLM
         ↓
Piper TTS server  →  audio response
         ↓
Wyoming Satellite  →  speaker output

The Compose stack runs Whisper, Piper, and optionally Music Assistant on your main server. Home Assistant connects to those services via Wyoming. Your microphone hardware just needs to be on the same LAN.

Hardware: Pick Your Mic

You’ve got a few good options depending on your budget and how DIY you want to go:

ESP32 Voice PE / Atom Echo: M5Stack’s Voice PE is probably the easiest entry point. It’s an ESP32-S3 with a dual-mic array and a speaker, runs ESPHome, and speaks Wyoming natively. Flash it, configure your HA server address, done. The Atom Echo is smaller and cheaper but mono mic, fine for quiet rooms, annoying in a kitchen.

Respeaker USB Mic Array: Seeed’s Respeaker 4-Mic Array is USB, works on any Pi or x86 box running wyoming-satellite. Good directional capture. If you already have a Pi sitting somewhere, this is the lazy path.

Wyoming Satellite on a Pi: For a custom enclosure or existing Pi 3B+/4/5, rhasspy/wyoming-satellite runs as a Python service, takes a USB or I2S mic as input, and connects back to your HA instance. A Pi Zero 2W + Respeaker 2-Mic HAT fits in an Altoids tin if that’s your thing.

One myth worth killing: a Google Coral USB accelerator does not speed up Whisper. The Coral’s Edge TPU only runs 8-bit quantized TFLite models (think object detection), and Whisper’s transformer doesn’t compile to it in any usable way. If you want fast local STT, the lever is a real CPU or an NVIDIA GPU, not a Coral. A Pi 5 on its own is plenty quick for base.en; save the Coral for Frigate.

The Compose Stack

Here’s the full stack. Adjust WHISPER_MODEL based on your hardware: tiny.en runs on anything, base.en is the sweet spot for accuracy vs. speed on modern CPUs, small.en if you have a GPU or a Pi 5 + Coral.

services:
  whisper:
    image: rhasspy/wyoming-whisper
    container_name: wyoming-whisper
    restart: unless-stopped
    ports:
      - "10300:10300"
    volumes:
      - whisper_data:/data
    environment:
      - WHISPER_MODEL=base.en
      - WHISPER_LANGUAGE=en
    command: --uri tcp://0.0.0.0:10300 --model base.en --language en

  piper:
    image: rhasspy/wyoming-piper
    container_name: wyoming-piper
    restart: unless-stopped
    ports:
      - "10200:10200"
    volumes:
      - piper_data:/data
    command: --uri tcp://0.0.0.0:10200 --voice en_US-lessac-medium

  openwakeword:
    image: rhasspy/wyoming-openwakeword
    container_name: wyoming-openwakeword
    restart: unless-stopped
    ports:
      - "10400:10400"
    volumes:
      - oww_data:/data
    command: --uri tcp://0.0.0.0:10400 --preload-model ok_nabu

  music-assistant:
    image: ghcr.io/music-assistant/server:latest
    container_name: music-assistant
    restart: unless-stopped
    network_mode: host
    volumes:
      - music_data:/data
    privileged: true

volumes:
  whisper_data:
  piper_data:
  oww_data:
  music_data:

A few notes on that config:

en_US-lessac-medium is the best-sounding Piper voice for English without going huge. The high quality variant sounds better but takes longer to synthesize; on a Pi 4 you’ll feel it. Swap to en_US-ryan-high if you’re on a real CPU box.
ok_nabu is Home Assistant’s built-in wake word. It’s… fine. You’ll train a custom one later (covered below).
Music Assistant gets network_mode: host because it needs mDNS and Chromecast discovery. Yes, it’s ugly. Yes, it’s necessary.

Spin it up:

docker compose up -d
docker compose logs -f whisper  # watch for model download

First boot downloads the Whisper model; base.en is about 140MB. Piper voices download lazily on first use.

Wiring It Into Home Assistant

Once your services are running, add them in HA under Settings → Voice Assistants → Add Assistant.

# Explicit Wyoming service config if autodiscovery doesn't catch it
# Usually not needed — HA finds them automatically on the local network

# Optional: set Assist as default pipeline
assist_pipeline:
  debug_recording_dir: /tmp/assist_debug

In the UI: Settings → Voice Assistants → Create pipeline. Select:

Speech-to-text: Whisper (your local instance at <host>:10300)
Text-to-speech: Piper (<host>:10200)
Wake word: openWakeWord (<host>:10400)
Conversation agent: Home Assistant (built-in) or your LLM agent

The “Conversation agent” is where the intent matching happens. Home Assistant’s built-in Assist handles a solid set of commands (lights, switches, covers, climate, media) without needing any LLM at all. It’s fast, deterministic, and works offline. If you want free-form queries (“what’s the energy usage this week?” or “is anyone home?”), wire in a local LLM via the Ollama integration instead.

Custom Wake Words with openWakeWord

ok_nabu is recognizable but not great. openWakeWord lets you train custom wake words on ~30 seconds of your own voice using a Google Colab notebook (ironic, but free).

The trained model drops in as a .tflite file:

  openwakeword:
    image: rhasspy/wyoming-openwakeword
    volumes:
      - oww_data:/data
      - ./custom_wakewords:/custom  # mount your .tflite files here
    command: >
      --uri tcp://0.0.0.0:10400
      --custom-model-dir /custom
      --preload-model hey_jarvis

Common community-trained models: hey_jarvis, alexa (yes, you can repurpose Alexa’s wake word for your local stack, petty but satisfying), hey_mycroft. Check the openWakeWord GitHub for the model zoo.

Accuracy tuning: if you’re getting false positives, increase --threshold (default 0.5, try 0.7). If it misses triggers, lower it. The sweet spot depends on your room acoustics and how much HVAC noise you’re dealing with.

Latency: What to Actually Expect

On a modern CPU box (Ryzen 5 or better, 2020+):

Wake word detection: ~100ms
Whisper base.en transcription: ~200-400ms for a 2-3 second utterance
HA intent matching: ~50ms
Piper TTS synthesis: ~150-300ms
Total round-trip: ~500-800ms, comparable to Alexa on a good day

On a Pi 4 (4GB):

Whisper tiny.en: ~400-600ms
Piper medium: ~400-500ms
Total: 1-2 seconds, usable, occasionally annoying

On a Pi 5 (CPU only):

Whisper base.en: ~300-500ms
Piper medium: ~300-400ms
Total: ~700ms-1s, noticeably snappier than a Pi 4, no accelerator needed

GPU box (RTX 3060 or better):

Enable --device cuda on the Whisper container
Transcription drops to ~50-100ms
Total: under 300ms, you’ll forget it’s local

If you’re on a Pi 4 and the latency is killing you, drop to tiny.en. Accuracy takes a hit on accented speech and mumbling, but for home automation commands (“turn off the bedroom light”) it’s totally fine. The vocabulary of home control is small and well-constrained.

Multilingual Support

Whisper is genuinely multilingual, handling 99 languages including decent support for Spanish, French, German, Portuguese, Japanese, and Mandarin. Switch the model from base.en to base (the multilingual variant) and set the language explicitly:

    environment:
      - WHISPER_MODEL=base
      - WHISPER_LANGUAGE=es  # ISO 639-1 language code
    command: --uri tcp://0.0.0.0:10300 --model base --language es

For Piper, swap the voice model to the appropriate language. The Piper voice list covers 40+ languages. Check rhasspy.github.io/piper-samples for samples before committing to a voice. Quality varies significantly by language; English and German have the best models right now.

Home Assistant Assist’s intent matching is language-aware too; set your assistant’s language in the pipeline config and it’ll route to the right intent parser.

Rhasspy vs. Wyoming: What Happened?

If you’ve been in the home automation space for a while, you remember Rhasspy, the original fully-offline voice assistant platform. It’s still alive (v2.5.x), still works, and still has a dedicated community. But Wyoming is where active development is happening.

The difference in practice: Rhasspy is a monolithic application that packages STT, TTS, NLU, and wake word detection together. Wyoming is a protocol: a set of socket interfaces that let you mix and match services however you want. Whisper is a Wyoming service. Piper is a Wyoming service. Your wake word detector is a Wyoming service. Home Assistant speaks Wyoming natively.

If you have a working Rhasspy setup, there’s no urgency to migrate. But for new installs in 2026, Wyoming is the modern path. It’s more modular, better maintained, and HA integration is first-class.

The Part Where Amazon Doesn’t Get Your Grocery List

Honestly, this is the thing that sold me. Every voice command you issue stays on your hardware. Your shopping list, your medication reminders, your morning briefings. None of it leaves your LAN. There’s no “skill” to enable, no premium tier, no sudden “we’re changing the API” deprecation notice at 2 AM.

The trade-off is real: setup takes an afternoon instead of five minutes, and you’ll spend some time tuning wake word sensitivity and Piper voice settings. But once it’s running, it just works. And when something breaks, it breaks locally where you can actually debug it.

Your Alexa has been listening. Time to evict it.

Quick Reference

Hardware	Recommended Model	Expected Latency
Pi 4 (4GB)	`tiny.en`	1 to 2s
Pi 5 (CPU)	`base.en`	~700ms, 1s
x86 CPU (modern)	`base.en`	500 to 800ms
x86 + NVIDIA GPU	`small.en`	<300ms

Key ports:

Whisper STT: 10300
Piper TTS: 10200
openWakeWord: 10400
Music Assistant: 8095 (web UI)

For the full Compose stack, configuration examples, and custom wake word training notes, grab the files from the examples repo linked at the top of this article. Your 2 AM self will appreciate having everything in one place.

Local Voice Assistant: Whisper + Piper + Home Assistant

Your Smart Home Is Phoning Home. Let’s Fix That.

The Architecture: Wyoming Is the Glue

Hardware: Pick Your Mic

The Compose Stack

Wiring It Into Home Assistant

Custom Wake Words with openWakeWord

Latency: What to Actually Expect

Multilingual Support

Rhasspy vs. Wyoming: What Happened?

The Part Where Amazon Doesn’t Get Your Grocery List

Quick Reference

Responses from around the web

Discussion

Related Posts

Owntracks + Home Assistant: Private Location Tracking

Claude Code + SearXNG: Private Web Search

Collateral Freedom: Costly to Block

SearXNG vs Whoogle: Private Search Frontends

Local Voice Assistant: Whisper + Piper + Home Assistant

Your Smart Home Is Phoning Home. Let’s Fix That.

The Architecture: Wyoming Is the Glue

Hardware: Pick Your Mic

The Compose Stack

Wiring It Into Home Assistant

Custom Wake Words with openWakeWord

Latency: What to Actually Expect

Multilingual Support

Rhasspy vs. Wyoming: What Happened?

The Part Where Amazon Doesn’t Get Your Grocery List

Quick Reference

Related Reading

Responses from around the web

Discussion

Related Posts

Owntracks + Home Assistant: Private Location Tracking

Claude Code + SearXNG: Private Web Search

Collateral Freedom: Costly to Block

SearXNG vs Whoogle: Private Search Frontends