Piper TTS

tl;dr

Speech sythesis on Linux has has sucked for a long time, but pretrained models are making pleasant-sounding text-to-speech (TTS) available on Linux (with a bit of manual configuration). I'm posting my config so you don't need to waste hours googling for it like I did.

Who this is for: You run Linux and want speech synthesis that doesn't suck

Why I'm writing this: To save you the trouble of hours of fruitless google searching

TTS on Linux

Text to Speech (TTS) is an important part of accessibility, and most OSes have some kind of TTS built in. For a long time, Linux has been lagging behind Windows and MacOS in this respect. For as long as I've been using Linux, the default TTS engine has been speech-dispatcher with the espeak engine which sounds... very bad. Like, this bad.

Although I can use a computer without screen reader, I like having my computer read longform articles to me --- particularly at the end of the day when my eyes are tired from a full day of staring at screens. Life-tip: Firefox's built in reader-mode can read the content of any webpage using whatever TTS is built into your OS.

For a long time, the default TTS for Linux (espeak) has sounded like it was built in the 90s --- because it was. Espeak has the advantage of of being lightweight and fast on very old machines, but honestly hurts to listen to. Alternative Linux TTS solutions (Festival, Flite, Mary TTS, Mimic, etc.) sound marginally better, but still lag far behind the default speech synthesis engines on Windows or MacOS. Moreover, documentation is hard to find (at best) for configuring alternative speech synthesis engines and getting them to place nice with Linux's speech-dispatcher.

TTS on Linux is so bad that, for many years, I used the Microsoft Edge browser specifically because it includes human-sounding TTS voices. However, if you're reading this, you, like me, likely object to the unblockable telemetry and tracking built into Microsoft Edge (and recent builds of Edge disabled Azure TTS on Linux anyway).

Fortunately, TTS on Linux is moving forward, and I'm here to tell you that you can now have privacy respecting and natural-sounding speech synthesis on Linux, without too many tears.

Piper TTS

Piper (https://github.com/rhasspy/piper) is community project, primarily driven by Mike Hansen (synesthesiam) the guy behind Mimic3 from the now defunct MycroftAI project. It sounds really good! You can listen to samples here.

Running Piper by itself (e.g. in Python or a docker container) is well documented and super easy. What's less well documented is getting Linux's speech-dispatcher to use Piper.

Sam Thursfield managed to get it running, and documented his setup in a comment to his post, The State of screen reading on desktop Linux. I tried his config on my machine (running Arch Linux) but gave up after wasting an evening debugging it.

A month or two later, I found this comment on the Piper TTS package in the Arch User Repository, and that basically fixed it for me.

Below are the config steps that worked for me.

Install Piper TTS. If you're on Arch, you can install a binary from the Arch User Repository with yay -S piper-tts-bin.
Add the Piper module to speech-dispatcher: You need to tell speech-dispatcher to use Piper instead of espeak by adding a line for the speechd module you're going to create in the next step. Your speech-dispatcher config is probably located at .config/speech-dispatcher/speechd.conf in your user's home directory, but edit the path to speechd.conf if it's located somewhere else. You can also open the file up in a GUI text editor and add the AddModule and AudioOutputMethod lines if cat isn't your thing.

$ cat ~/.config/speech-dispatcher/speechd.conf
AddModule "piper-generic" "sd_generic" "piper-generic.conf"

AudioOutputMethod “pulse”

Download pretrained voice models for your language from the Piper voices page. Each voice model has two files: a .onnx model file and a .onnx.json config file. I'm storing these in /opt/piper-tts/voices/
create the Piper TTS module, which pipes text data through a script that calls Piper (we'll create that script in the next step). In the AddVoice lines, add the file name (without the file extension) of the voice models you downloaded. For example to add en_US-ryan-medium.onnx add a line that says AddVoice "US Ryan" "MALE1" "en_US-ryan-medium" in the config.

$ cat ~/.config/speech-dispatcher/modules/piper.conf
Debug "1"

GenericExecuteSynth "env DATA='$DATA\' VOICE=\'$VOICE\'  RATE=\'$RATE\' /opt/piper-tts/piper-pipe"

GenericCmdDependency "piper-tts"

# “MALE1” and “FEMALE1” are standard names defined in
# `src/modules/module_utils_addvoice.c`.
AddVoice    "en-gb"    "MALE1"    "en-gb-alan-low"
AddVoice    "en-gb"    "FEMALE1"    "en-gb-southern_english_female-low"
AddVoice    "US Ryan"    "MALE1"    "en_US-ryan-medium"
AddVoice    "US Joe"    "MALE1"    "en_US-joe-medium"
AddVoice    "US Librtts"    "FEMALE1"    "en_US-libritts_r-medium"

DefaultVoice "en_US-libritts_r-medium"

Create an executable script (referenced in the config above) at /opt/piper-tts/piper-pipe with the contents below. To make the file executable after you create it, run chmod +x /opt/piper-tts/piper-pipe.

#!/bin/bash

VOICE_PATH="/opt/piper-tts/voices"

if [[ ${VOICE: -3} = low ]]; then
  ADJ_RATE=16000
else
  ADJ_RATE=22050
fi
ADJ_RATE=$((${RATE::-3} * 30 + $ADJ_RATE))
echo "$DATA" | piper-tts --model $VOICE_PATH/$VOICE.onnx  --output-raw | \
aplay -r $ADJ_RATE -f S16_LE -t raw -

wait

After you've finished the config, you'll need to restart the speech-dispatcher service to get it to use the Piper module. You can restart the service by rebooting or run systemctl restart speech-dispatcher.service to restart the speech dispatcher daemon. Depending on your OS you may need to sudo or run systemctl restart with the --user flag.
Test speech-dispatcher with spd say "some text" or try the narration in Firefox's reader-mode.

Piper TTS

Text to Speech on Linux that doesn't suck

tl;dr

TTS on Linux

Piper TTS