3 minutes
My self-hosted TextToSpeech setup
For a while, I had been using a simple keyboard shortcut for xclip -o | espeak-ng
as my TextToSpeech setup. I mostly use it in-browser, highlighting text and using the shortcut to speak the highlighted text.
I have been meaning to update it, get more modern voices and make it easier to setup on heterogenous environments (fancy way to say I struggle to have reliable hotkey actions on my work macbook).
After setting up a voice assistant on homeassistant using piper, and being impressed with the quality of the voices, I decided to give piper a go.
New Setup
I am now using this simple Firefox Addon I built to grab the highlighted text, and send it to this API wrapper I built around the piper binary.
Firefox Addon
The addon simply adds a keyboard shortcut and context menu. When executed, the addon’s background script sends the highlighted text to the TTS API through a POST
request (along with the saved voice and speech speed preferences) to the /api/stream
endpoint.
The API instantly returns a unique URL for the stream. The Addon then injects through the content script an audio player pointing to the returned URL into the current page. It then starts the playback.
I initially tried relying only on the background script to play the speech, but audio playback would get cut after a few seconds (it seems background scripts are not meant to run for this long). Combining the background script with a content script turned out to be a better option with the UI allowing the user some control on the playback.
Piper API Wrapper
The /api/stream
endpoint accepts POST
requests with playback options (voice, speed, …) and text. It assigns a UUID
to the request, saves it in memory, and returns a URL referencing this stream UUID
.
When a stream UUID
url is accessed, the options are fetched from the cache and passed to the piper
binary to start generating the raw audio.
Streaming compatible WAV headers are sent back to the client. Raw audio chunks are sent back to the client as soon piper
starts generating them.
The /api/stream
endpoint returning a unique stream URL trick allows us to easily play the audio in the browser using an <audio>
tag (which performs a GET
request).
I initially opted for an HTTP to wyoming bridge. This worked fine, but wyoming was un unnecessary overhead, and waited for the full text to be processed to start returning audio, resulting in long delays (especially running this on a raspi) before the TTS started its speaking.
Calling directly the piper binary from my wrapper allowed me to pass the --output-raw
flag, and start getting audio data much earlier, streaming it back to the client without waiting for entire potentially large text to be processed.
Final thoughts
Even with streaming, there’s still slightly more delay on speech start with piper compared to the local xclip
to espeak
. YMMV if you use a beefier box than me.
Overall I like the in-browser integration and the html interface on gopipertts if I need to quickly TTS from a non browser source.
The piper voices are a real improvement over what espeak
provides… In English.
French is a different story… or perhaps this is what we sound like when saying “tartine”?