My self-hosted TextToSpeech setup

Listen to this post:

For a while, I had been using a simple keyboard shortcut for xclip -o | espeak-ng as my TextToSpeech setup. I mostly use it in-browser, highlighting text and using the shortcut to speak the highlighted text.

I have been meaning to update it, get more modern voices and make it easier to setup on heterogenous environments (fancy way to say I struggle to have reliable hotkey actions on my work macbook).

After setting up a voice assistant on homeassistant using piper, and being impressed with the quality of the voices, I decided to give piper a go.

New Setup

I am now using this simple Firefox Addon I built to grab the highlighted text, and send it to this API wrapper I built around the piper binary.

Firefox Addon

The addon simply adds a keyboard shortcut and context menu. When executed, the addon’s background script sends the highlighted text to the TTS API through a POST request (along with the saved voice and speech speed preferences) to the /api/stream endpoint. The API instantly returns a unique URL for the stream. The Addon then injects through the content script an audio player pointing to the returned URL into the current page. It then starts the playback.

I initially tried relying only on the background script to play the speech, but audio playback would get cut after a few seconds (it seems background scripts are not meant to run for this long). Combining the background script with a content script turned out to be a better option with the UI allowing the user some control on the playback.

Piper API Wrapper

The /api/stream endpoint accepts POST requests with playback options (voice, speed, …) and text. It assigns a UUID to the request, saves it in memory, and returns a URL referencing this stream UUID.

When a stream UUID url is accessed, the options are fetched from the cache and passed to the piper binary to start generating the raw audio.

Streaming compatible WAV headers are sent back to the client. Raw audio chunks are sent back to the client as soon as piper starts generating them.

The /api/stream endpoint returning a unique stream URL trick allows us to easily play the audio in the browser using an <audio> tag (which performs a GET request).

I initially opted for an HTTP to wyoming bridge. This worked fine, but wyoming was an unnecessary overhead, and waited for the full text to be processed to start returning audio, resulting in long delays (especially running this on a raspi) before the TTS started its speaking. Calling directly the piper binary from my wrapper allowed me to pass the --output-raw flag, and start getting audio data much earlier, streaming it back to the client without waiting for the entire potentially large text to be processed.

Final thoughts

Even with streaming, there’s still slightly more delay on speech start with piper compared to the local xclip to espeak. YMMV if you use a beefier box than me.

Overall I like the in-browser integration and the html interface on gopipertts if I need to quickly TTS from a non browser source.

The piper voices are a real improvement over what espeak provides… In English.

French is a different story… or perhaps this is what we sound like when saying “tartine”?