How-to: Speech synthesis (TTS)¶

Overview¶

The robot is able to generate speech using a text-to-speech (TTS) engine. As of PAL OS edge, two backends are available, namely Acapela and a non-verbal backend. Additionally, the multi-modal expression markup language can be used to synchronize the speech with other communication modalities, like gestures or lights, or other advanced features.

ROS interface¶

There are two main nodes involved in the TTS pipeline:

tts_engine node, which is responsible for generating the speech and publishing it on the /audio_out/raw output, which gets played on the robot’s loudspeakers. There are two different plugins currently available for this node, the Acapela backend and the Non-verbal backend.
communication_hub node, which among other things hosts the skill say, the main interface to make PAL’s robots express itself. It executes multi-modal expressions, which are basically utterances (spoken through the above tts_engine) with synchronized actions (like gestures or lights) and pauses. A Multi-modal expression markup language is used to specify the utterance and the actions to be performed.

tts_engine is a localized node, meaning its current language selection is controlled by the i18n_manager. Refer to [‼️ROS 1] Internationalisation and language support for more information on language availability and selection.

Web interface¶

The Web User Interface provides the status of tts_engine node under Diagnostics > Communication > TTS > tts_engine.

There you can check, among other things:

Current engine: the engine currently loaded;
Current locale: the language used by default;
Current voice: the synthetic voice used by default.

The Diagnostics > Communication > Manager > communication_hub section provides instead the status of the communication_hub node.

Many of the values here are related to the How-to: Dialogue management, but you can specifically check the Expression N values, which monitor the status of the multi-modal expressions. Note that the multi-modal expressions may originate both from calls to say and from the How-to: Dialogue management system.

TTS backends¶

The backend is selected by the tts_engine parameter non_verbal_mode. If true the Non-verbal backend is used, otherwise the Acapela backend is the one being loaded. To set it temporarily, the following command can be executed through command line:

ros2 param set /tts_engine non_verbal_mode true

To persist the parameter setting through robot reboots, see the section Configuration files.

Acapela backend¶

The default TTS engine used by the robot uses the proprietary speech synthesis engine from Acapela Group.

The technology used in this engine is the one that leads the market of synthetic voices. It is based on unit selection and allows to produce highly natural speech in formal style. The system is able to generate speech output, based on a input text utterance [1]. It does the phonetic transcription of the text, predicts the appropriate prosody for the utterance and finally generates the signal waveform.

Every time a text utterance is sent to the text-to-speech (TTS) engine it generates the corresponding waveform and plays it using the robot speakers.

Non-verbal backend¶

The non-verbal backend is a TTS engine that generates a ‘R2D2’-like non-verbal utterance. This utterance is deterministically generated from the input text: the same input text will always generate the same output.

Using non-verbal TTS is useful when you chose to design a robot persona that is less anthropomorphic. In particular, it will typically reduce the expectation that the robot is able to understand and respond to arbitrary spoken language.

Multi-modal expression markup language¶

The multi-modal expression markup language is a feature added on top of TTS synthesis. Using markups inserted in the text to be synthesized, it integrates the speech synthesis with other robot functionalities.

The full markup action format is <verb name(arguments) timeout>. The arguments and timeout are optional, and the minimal markup action format is <verb name>.

The verbs must be one of:

set : ‘start and forget’ the action; useful when you do not need to know if/when the action is completed
start : start an action
wait : wait for a previously started action to finish (the first one found backwards with the same name)
stop : stop an on-going action (the first one found backwards with the same name)
do : equivalent to start immediately followed by wait (i.e., blocks until the action is completed)

Markup action which are started (not set) and are not waited nor stopped explicitly, are implicitly waited at the end of the multi-modal expression.

The currently supported actions are:

motion(name) : perform the name predefined motion
expression(name) : set the name predefined facial expressions
led_fixed(r,g,b) : set the leds to a fixed color with RGB values in the range [0, 255]
led_blink(r,g,b) : set the leds to a blinking color with RGB values in the range [0, 255]

The timeout specifies the maximum number of seconds to wait for the execution of an markup action.

Using markup action, one can synchronize the speech with a facial expression or a gesture. For instance, the expression: <set expression(happy) <start motion(wave)> Hello! <wait motion timeout=1> <set expression(neutral)>. will make the robot say “Hello!” while waving and with a happy expression, wait until the waving motion is finished (or 1 second has passed since the motion start), and then return to a neutral expression.

Note

communication_hub node provides the parameter disabled_markup_actions to disable the execution of specific markup actions. By default, for safety reasons, the motion(name) action is listed here and disabled.

Built-in action¶

These actions resembles the standard markup actions semantically and synthetically, but are exceptions to the rules.

Currently, the only built-in action are: * <pause(time)> : the utterance is paused for the specified time in seconds

For instance, the expression: <start motion(wave)> Hello! <pause(2)> Anything new going on? <wait motion>. will make the robot stay silent for 2 seconds after saying “Hello!”.

Check the TTS from the terminal¶

Goals to the action server can be sent through command line by typing:

ros2 action send_goal /say communication_skills/action/Say "m

Then, by pressing Tab the required message type will be auto-completed. The fields under input can be edited to synthesize the desired sentence, as in the following example:

ros2 action send_goal /say communication_skills/action/Say "meta:
  caller: ''
  priority: 0
person_id: ''
group_id: ''
input: 'Hello world!'"

Note

The locale field can be used to select a specific language. If left empty, the current system language will be used. The voice field can be used to select a specific voice. The list of available locales and voices is printed by the tts_engine node on startup. You can check it by running the following terminal command: pal module log tts_engine head -n 50