Dataset description


Motivation

The UDIVA dataset aims to move beyond automatic individual behavior detection and focus on the development of automatic approaches to study and understand the mechanisms of influence, perception and adaptation to verbal and nonverbal social signals in dyadic interactions, taking into account individual and dyad characteristics as well as other contextual factors. To the best of our knowledge, there is no similar publicly available, non-acted face-to-face dyadic dataset in the research field in terms of number of views, participants, tasks, recorded sessions, and context labels.

UDIVA Statistics

The UDIVA dataset is composed of 90.5h of recordings of dyadic interactions between 147 voluntary participants (55.1% male) from 4 to 84 years old (mean=31.29), coming from 22 countries (68% from Spain). The majority of participants were students (38.8%), and identified themselves as white (84.4%). Participants were distributed into 188 dyadic sessions, with a participation average of 2.5 sessions/participant (max. 5 sessions). The most common interaction group is Male-Male/Young-Young/Unknown (15%), with 43% of the interactions happening among known people. Spanish is the majority language of interaction (71.8%), followed by Catalan (19.7%) and English (8.5%). Half of the sessions include both interlocutors with Spain as country of origin. Technically, the data was acquired using 6 HD tripod-mounted cameras (1280×720px, 25fps), 1 lapel microphone per participant and an omnidirectional microphone on the table. Each participant also wore an egocentric camera (1920×1080px, 30fps) around their neck and a heart rate monitor on their wrist. All the capturing devices are time-synchronized and the tripod-mounted cameras calibrated. Figure 1 illustrates the recording setup and the different views of UDIVA dataset. Figure 2 illustrates the different context (i.e., tasks) of UDIVA dataset.

Figure 1: Recording environment. We used six tripod-mounted cameras, namely GB: General Back camera, GF: General Frontal camera, HA: individual High Angle cameras and FC: individual Frontal Cameras, and two ego cameras E (one per participant, placed around their neck). a) Position of cameras, general microphone and participants. b) Example of the time-synchronized 8 views.

Figure 2: Examples of the 5 tasks included in the UDIVA dataset from 5 sessions. From left to right: Talk, Lego, Animals, Ghost, Gaze.

Data protection and ethics

The UDIVA dataset is currently stored in a secured server at the Computer Vision Center, Barcelona, Spain. Data collection and storage have been performed in compliance with General Data Protection Regulation (GDPR) and under ethical approval issued by the Universitat de Barcelona Bioethics department. As authors of the dataset, we confirm that we have permission to release it. A Dataset License will be attached to the large-scale dataset contents. Participants signed a consent form prior to the start of the recordings in which they granted their consent to share the data with the community for non-commercial purposes; therefore, only users belonging to academic or research organisations will be able to request access to the dataset. Users may only use the dataset after the Dataset License has been signed and returned to the dataset administrators. Users may not transfer, distribute or broadcast the dataset or portions thereof in any way. Users may use portions or the totality of the dataset provided they acknowledge such usage in their publications by citing the dataset release paper.

UDIVA v0.5 Dataset

The UDIVA v0.5 dataset is a preliminary version of the UDIVA dataset, including a subset of the participants, sessions, synchronized views, and annotations of the complete UDIVA dataset. The UDIVA v0.5 dataset is composed of 145 dyadic interaction sessions divided in 4 different tasks each: Talk, Lego, Ghost, and Animals. Such sessions are performed by 134 participants (ranging from 17 to 75 years old, 55.2% male), who can participate in up to 5 sessions with different participants. Spanish is the majority language (73.1%), followed by Catalan (17.25%) and English (9.65%).

The UDIVA v0.5 dataset consists of the subset of recordings and metadata used for the evaluation of the context-aware personality inference method presented here. In addition to that data, we are adding the transcripts of the session conversations, and a set of automatically extracted annotations, namely:

  • Face landmarks: 68 face fiducials were extracted using the 3DDFA_v2 algorithm, and the detection confidence provided by the face detector (Faceboxes). Additionally, an average smoothing with the immediately previous and next frames was applied.

  • Body landmarks: full-body joints and detection confidence were extracted using the MeTRAbs method. The x and y coordinates of the 3D landmarks provided by MeTRAbs were matched to 2D landmarks by finding the 3D transformation that solves the least squares problem (click here to download the python script)

  • Hand landmarks: hand landmarks and detection confidences were extracted by FrankMocap. Additionally, several post-processing steps were applied to them to improve their quality:

    1. Hand detections needed for the landmark extractions were tracked with SiamRPN++ when temporal or big spatial gaps were found.

    2. Body pose and hand landmarks were leveraged to ensure that only the hands from the person of interest were detected.

    3. For the errors identified, a second landmark extraction was run.

  • 3D eye gaze vectors: gaze vectors were extracted using ETH-XGaze, using the previously extracted face landmarks.

Talk videos from validation and test sets underwent visual inspection in order to assess their accuracy. For each frame, raters manually checked face, body and hands landmarks (the gaze vector was not assessed):

  • Face landmarks were discarded (valid flag set to False) when either the face orientation or the position of the landmarks was slightly wrong.

  • Body landmarks were discarded (valid flag set to False) when either the overall pose was considered mistaken or one side was strongly displaced from the real joint positions.

  • Hand landmarks:

    • Landmarks were discarded (valid flag set to False) when fingers were strongly displaced. Mild fingers displacements were tolerated when the overall orientation and hand placement was correct.

    • Landmarks of hands hidden under the table (false positives) were discarded.

    • Mismatched hands with switched left/right labels were swapped.

    • For frames within periods of time (t0 to t) in which the participant did not move their hands but predicted landmarks were wrong (e.g. due to hands interaction), landmarks were interpolated using those from frames t0-1 and t+1, if the rater considered that such interpolation would yield landmarks fulfilling our accuracy standards.

Data Structure and Annotations

Detailed information about the UDIVA v0.5 data structure and annotations is given here.

Dataset access

The UDIVA v0.5 dataset is available upon request. Users will have access to the dataset after the Dataset License has been signed and returned to the dataset administrators. Kindly review the License rules comprehensively for detailed guidance on completing the license, returning it to the administrators, and ensuring proper usage of the data. For any dataset inquiries please contact us at UDIVA [at] ub.edu (the udiva [at] cvc.uab.cat email is no longer in use).

News


UDIVA v0.5 dataset description released

The detailed dataset description can now be accessed here.