Workshop description

Chalearn Workshop on Action, Gesture, and Emotion Recognition: Large Scale Multimodal Gesture Recognition and Real versus Fake expressed emotions @ICCV17

Venice, Italy

Octobr 2017

Aims and scope:  Many recent breakthroughs in computer vision have become from the availability of large labeled datasets, such as ImageNet [1], which has millions of images labeled with thousands of classes. Their availability has significantly accelerated research in detecting and classifying objects in static images. Compared with static image understanding, video-based understanding (such as action/gesture recognition) is relatively lagging. There are some main problems listed:

  1. Large-scale video-based dataset. For image analysis a lot of large-scale datasets are available, such as ImageNet for object recognition/object location, CASIA-WEBFACE [2] and Ms-celeb-1M [3] for face recognition and so on. However, for video analysis understanding, one of the key bottlenecks for further advancements in this area has been the lack of real-world video datasets with the same scale and diversity as image datasets [4]. Fortunately, there are some large-scale video-based datasets released in recent two years, such as YouTube-8M [4] and Sports-1M [5]. Hopefully, the availability of these data sets (and the one we describe below) will push the state-of-the-art in video analysis.
  2. Technical Developments for video-based recognition.  Compared with 2D static image analysis, videos provide more information for recognizing objects, and understanding human actions and interactions with other objects. Improving video understanding can lead to better video search and discovery (like image understanding helped re-imagine the photos experiences). Thus, another one of key bottlenecks is the difficulty of technical development for 3D/4D video analysis.
  3. Fake vs True Emotion Recognition from video sequences.  Being able to recognize deceit and the authenticity of emotional displays is notoriously difficult for human observers because of the subtlety or short duration of discriminative facial responses. Applications are numerous, from determining deceiving behavior in police investigations, to improving border control by understanding the incongruity between what is expressed and what is experienced. For this challenge a new database, the SASE-FE database, consist of 643 different videos which have been recorded with a high resolution GoPro-Hero camera, has been prepared and labelled. The challenges are recognition of fakeness vs trueness of emotion and recognition of fakeness vs trueness within a specific emotion, e.g. fake vs true surprise.

As one of the important branches in video analysis of humans (named Looking at People), recognizing gestures and human actions have become a research area of great interest as it has many potential applications domains including human-computer interfaces (HCI), virtual reality, augmented reality and sign language interpretation.  We propose a workshop (with associated competitions) on action, gesture and emotion recognition. This session aims at compiling the latest efforts and research advances from the scientific community in enhancing traditional computer vision and pattern recognition algorithms with action/gesture analysis at both the learning and prediction stages. This workshop aims at compiling the state of the art in all aspects of action, gesture and emotion analysis, including methods for modeling and analyzing: action, behavior, gesture, body, hand, face, emotion and multimodality.

Workshop topics and guidelines: The scope of the workshop comprises all aspects of video analysis of learning machines for gesture and action recognition. Including but not limited to the following topics:

  • Group interactions and human-object interaction in videos
  • Early action/gesture detection/recognition
  • End-to-end action/gesture recognition
  • Multimodal emotion recognition
  • Fake vs true emotion recognition
  • Temporal domain based human behaviour analysis
  • One-shot/zero-shot learning for action/gesture recognition
  • Fusion-based methods for action/gesture recognition
  • Object detection and recognition in videos
  • Learning approaches for a large number of gesture/action classes
  • 3D hand capture and reconstruction
  • Background/foreground modeling for action and gesture recognition
  • Simultaneous gesture/action spotting and recognition
  • Applications of gesture and action, such as smart surveillance systems, human-computer interaction
  • Applications of hand pose estimation in AR/VR


Submissions can be done through CMT web page:




[2] Yi D, Lei Z, Liao S, et al. Learning face representation from scratch[J]. arXiv preprint arXiv:1411.7923, 2014.

[3] Guo Y, Zhang L, Hu Y, et al. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition[C]//European Conference on Computer Vision. Springer International Publishing, 2016: 87-102.

[4] Abu-El-Haija S, Kothari N, Lee J, et al. Youtube-8m: A large-scale video classification benchmark[J]. arXiv preprint arXiv:1609.08675, 2016.

[5] Karpathy A, Toderici G, Shetty S, et al. Large-scale video classification with convolutional neural networks[C]//Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2014: 1725-1732.


There are no news registered in