Social perception#

The robot can detect and identify faces, detect 2D and 3D skeletons, perform speech and intent recognition, and fuse together various social signal to track multi-modal persons.

THe robot’s social perception pipeline is compliant with the ROS4HRI REP-155 ROS standard.

Note that the entire pipeline runs on-board; no cloud-based services are used (and consequently, no Internet connection is required).

The following figure provides an overview of the pipeline:

image/svg+xml Voice processing /humans/faces/tracked /humans/faces/*/roi /humans/faces/*/... Face detectionhri_face_detect Skeleton trackinghri_fullbody /humans/bodies/tracked /humans/bodies/*/roi /humans/bodies/*/... /candidate_matches Face database~/.pal/face_dbFace identificationhri_face_identification /humans/persons/tracked /humans/persons/*/face_id /humans/persons/*/body_id /humans/persons/*/voice_id /humans/persons/*/... Multi-modal fusionhri_person_manager /humans/voices/tracked /humans/voices/*/speech /humans/voices/*/... knowledge base people_facts



These are the main limitations of the pal-sdk-23.1 social perception capabilities:

  • Person detection and face identification rely on external tools (Google MediaPipe and dlib). Like all vision-based algorithms, these tools do not always provide accurate estimate or might mis-detect/mis-classify people.

  • Body detection is currently single-body only;

  • Faces needs to be within a ~2m range of the robot to be detected;

  • No voice separation, identification or localisation is currently available: from the robot point of view, it always hears the same one voice;

  • ARI does not yet implement the ‘group interactions’ part of the specification (e.g. no automatic group detection).

General documentation#

Tutorials and how-tos#

API reference#