Speech synthesis (TTS)#
Overview of the technology#
Your robot is able to generate speech using a text-to-speech (TTS) engine. As of PAL OS 25.01, two backends are available, namely Acapela and a non-verbal backend. Additionally, the multi-modal expression markup language can be used to synchronize the speech with other communication modalities, like gestures or lights, or other advanced features.
Acapela backend#
The default TTS engine used by the robot uses the proprietary speech synthesis engine from Acapela Group.
The technology used in this engine is the one that leads the market of synthetic voices. It is based on unit selection and allows to produce highly natural speech in formal style. The system is able to generate speech output, based on a input text utterance [1]. It does the phonetic transcription of the text, predicts the appropriate prosody for the utterance and finally generates the signal waveform.
Every time a text utterance is sent to the text-to-speech (TTS) engine it generates the corresponding waveform and plays it using the robot speakers.
Non-verbal backend#
The non-verbal backend is a TTS engine that generates a ‘R2D2’-like non-verbal utterance. This utterance is deterministically generated from the input text: the same input text will always generate the same output.
Using non-verbal TTS is useful when you chose to design a robot persona that is less anthropomorphic. In particular, it will typically reduce the expectation that the robot is able to understand and respond to arbitrary spoken language.
To enable the non-verbal TTS, the TTS node parameter non_verbal_mode
must be true.
To set it temporarily, the following command can be executed through command line:
ros2 param set /tts_engine non_verbal_mode true
To persist the parameter setting through robot reboots, see the section Configuration files.
Multi-modal expression markup language#
The multi-modal expression markup language is a feature added on top of TTS synthesis. Using markups inserted in the text to be synthesized, it integrates the speech synthesis with other robot functionalities.
The full markup action format is <verb name(arguments) timeout>
.
The arguments and timeout are optional, and the minimal markup action format is <verb name>
.
The verbs must be one of:
set
: ‘start and forget’ the action; useful when you do not need to know if/when the action is completedstart
: start an actionwait
: wait for a previously started action to finish (the first one found backwards with the same name)stop
: stop an on-going action (the first one found backwards with the same name)do
: equivalent tostart
immediately followed bywait
(i.e., blocks until the action is completed)
Markup action which are started (not set) and are not waited nor stopped explicitly, are implicitly waited at the end of the multi-modal expression.
The currently supported actions are:
motion(name)
: perform thename
predefined motionexpression(name)
: set thename
predefined facial expressionsled_fixed(r,g,b)
: set the leds to a fixed color with RGB values in the range [0, 255]led_blink(r,g,b)
: set the leds to a blinking color with RGB values in the range [0, 255]
The timeout specifies the maximum number of seconds to wait for the execution of an markup action.
Using markup action, one can synchronize the speech with a facial expression or a gesture.
For instance, the expression:
<set expression(happy) <start motion(wave)> Hello! <wait motion(wave) timeout=1> <set expression(neutral)>
.
will make the robot say “Hello!” while waving and with a happy expression,
wait until the waving motion is finished (or 1 second has passed since the motion start),
and then return to a neutral expression.
Text-to-Speech node#
Launching the node#
System diagnostics described in section Web user interface allow to check the status of the TTS service runing in the robot. The /say action is provided by the communication_hub node. These services are started by default on start-up, so normally there is no need to start them manually. To start/stop them, the following commands can be executed in a terminal opened in the multimedia computer of the robot:
pal module start tts_engine
pal module start communication_hub
pal module stop tts_engine
pal module stop communication_hub
Action interface#
See /say
Examples of usage#
Web user interface#
Command line#
Goals to the action server can be sent through command line by typing:
ros2 action send_goal /say tts_msgs/TTS "i
Then, by pressing Tab the required message type will be auto-completed. The fields under rawtext
can be
edited to synthesize the desired sentence, as in the following example:
ros2 action send_goal /say tts_msgs/TTS "input: 'Hello world!'
locale: ''
voice: ''"
Note
The locale
field can be used to select a specific language.
If left empty, the current system language will be used.
The voice
field can be used to select a specific voice.
The list of available locales and voices is printed by the tts_engine node on startup.
You can check it by running the following terminal command:
pal module log tts_engine head -n 50