../_images/tiagopro-icon.png ../_images/tiago-head-icon.png ../_images/tiago-icon.png ../_images/ari-icon.png

Speech synthesis (TTS)#

Overview of the technology#

Your robot is able to generate speech using a text-to-speech (TTS) engine. As of PAL OS 25.01, two backends are available, namely Acapela and a non-verbal backend. Additionally, the multi-modal expression markup language can be used to synchronize the speech with other communication modalities, like gestures or lights, or other advanced features.

Acapela backend#

The default TTS engine used by the robot uses the proprietary speech synthesis engine from Acapela Group.

The technology used in this engine is the one that leads the market of synthetic voices. It is based on unit selection and allows to produce highly natural speech in formal style. The system is able to generate speech output, based on a input text utterance [1]. It does the phonetic transcription of the text, predicts the appropriate prosody for the utterance and finally generates the signal waveform.

Every time a text utterance is sent to the text-to-speech (TTS) engine it generates the corresponding waveform and plays it using the robot speakers.

Non-verbal backend#

The non-verbal backend is a TTS engine that generates a ‘R2D2’-like non-verbal utterance. This utterance is deterministically generated from the input text: the same input text will always generate the same output.

Using non-verbal TTS is useful when you chose to design a robot persona that is less anthropomorphic. In particular, it will typically reduce the expectation that the robot is able to understand and respond to arbitrary spoken language.

To enable the non-verbal TTS, the TTS node parameter non_verbal_mode must be true. To set it temporarily, the following command can be executed through command line:

ros2 param set /tts_engine non_verbal_mode true

To persist the parameter setting through robot reboots, see the section Configuration files.

Multi-modal expression markup language#

The multi-modal expression markup language is a feature added on top of TTS synthesis. Using markups inserted in the text to be synthesized, it integrates the speech synthesis with other robot functionalities.

The full markup action format is <verb name(arguments) timeout>. The arguments and timeout are optional, and the minimal markup action format is <verb name>.

The verbs must be one of:

  • set : ‘start and forget’ the action; useful when you do not need to know if/when the action is completed

  • start : start an action

  • wait : wait for a previously started action to finish (the first one found backwards with the same name)

  • stop : stop an on-going action (the first one found backwards with the same name)

  • do : equivalent to start immediately followed by wait (i.e., blocks until the action is completed)

Markup action which are started (not set) and are not waited nor stopped explicitly, are implicitly waited at the end of the multi-modal expression.

The currently supported actions are:

  • motion(name) : perform the name predefined motion

  • expression(name) : set the name predefined facial expressions

  • led_fixed(r,g,b) : set the leds to a fixed color with RGB values in the range [0, 255]

  • led_blink(r,g,b) : set the leds to a blinking color with RGB values in the range [0, 255]

The timeout specifies the maximum number of seconds to wait for the execution of an markup action.

Using markup action, one can synchronize the speech with a facial expression or a gesture. For instance, the expression: <set expression(happy) <start motion(wave)> Hello! <wait motion(wave) timeout=1> <set expression(neutral)>. will make the robot say “Hello!” while waving and with a happy expression, wait until the waving motion is finished (or 1 second has passed since the motion start), and then return to a neutral expression.

Text-to-Speech node#

Launching the node#

System diagnostics described in section Web user interface allow to check the status of the TTS service runing in the robot. The /say action is provided by the communication_hub node. These services are started by default on start-up, so normally there is no need to start them manually. To start/stop them, the following commands can be executed in a terminal opened in the multimedia computer of the robot:

pal module start tts_engine
pal module start communication_hub

pal module stop tts_engine
pal module stop communication_hub

Action interface#

See /say

Examples of usage#

Web user interface#

Command line#

Goals to the action server can be sent through command line by typing:

ros2 action send_goal /say tts_msgs/TTS "i

Then, by pressing Tab the required message type will be auto-completed. The fields under rawtext can be edited to synthesize the desired sentence, as in the following example:

ros2 action send_goal /say tts_msgs/TTS "input: 'Hello world!'
locale: ''
voice: ''"

Note

The locale field can be used to select a specific language. If left empty, the current system language will be used. The voice field can be used to select a specific voice. The list of available locales and voices is printed by the tts_engine node on startup. You can check it by running the following terminal command: pal module log tts_engine head -n 50