Results 2018

Results of the 2018 OMG-Emotion Recognition Challenge


Date Team Submission Repository Paper CCC Arousal CCC Valence Modality
05.2018 ADSC
1 We only take the video frames as input. First, as preprocessing, we apply MTCNN to crop and align faces, among which we randomly pick 16 images (in order) of an utterance as input. Then, the base network SphereFace outputs their features and a bidirectional LSTM is used. Finally, after averaging the LSTM output features, an FC and a tanh layer are applied to regress arousal-valence values. We use MSE loss.
Link Link 0.244058303158 0.437829242686 Vision
05.2018 ADSC
2 We keep the same network as submission #1 for video, and add another network for audio. First, as preprocessing, we extract WAV file of every utterance and acquire STFT maps every 10 ms. Then, 4 randomly chosen maps inputs to the base network: a modified VGG-16. After averaging the output features, we concatenate the audio feature and the video features. Finally, an FC and a tanh layer are applied to regress arousal-valence values. We use CCC for the joint training loss.
Link Link 0.236030982713 0.442048822313 Vision + Audio
05.2018 audEERING
1 Valence: Deep video features using VGGface and (B)LSTM for sequence prediction. Samples are shuffled within train and dev folds.
Arousal: openSMILE features with (B)LSTM for sequence prediction. Samples are shuffled within train and dev folds.
Link Link 0.2766972457 0.257974288965 Arousal: Audio, Valence: Video
05.2018 audEERING
2 We keep the same network as submission #1 for video, and add another network for audio. First, as preprocessing, we extract WAV file of every utterance and acquire STFT maps every 10 ms. Then, 4 randomly chosen maps inputs to the base network: a modified VGG-16. After averaging the output features, we concatenate the audio feature and the video features. Finally, an FC and a tanh layer are applied to regress arousal-valence values. We use CCC for the joint training loss.
Link Link 0.286322843804 0.368516561137 Arousal: Audio, Valence:Audio+Video
05.2018 audEERING
3 We keep the same network as submission #1 for video, and add another network for audio. First, as preprocessing, we extract WAV file of every utterance and acquire STFT maps every 10 ms. Then, 4 randomly chosen maps inputs to the base network: a modified VGG-16. After averaging the output features, we concatenate the audio feature and the video features. Finally, an FC and a tanh layer are applied to regress arousal-valence values. We use CCC for the joint training loss.
Link Link 0.292541922664 0.361123256087 Audio
05.2018 EMO-INESC
1 The implemented methodology is an ensemble of several models from two distinct modalities, namely video and text.
Link Link 0.14402485704 0.344500137514 Vision + Text
05.2018 ExCouple
1 The proposed model is for the audio module. All videos in the OMG Emotion dataset are converted to WAV files. In the presented process we make use of semi-supervised learning for the emotion recognition. A GAN is trained with unsupervised learning using another database (IEMOCAP), and part of the GAN autoencoder will be used for the audio representation. The audio spectrogram will be extracted in 1-second windows with 16kHz frequency and this will serve as input to the audio representation model. This audio representation will serve as input to a convolutional network and a Dense layer with 'tanh' activation that performs the prediction of Arousal and Valence values. To join the 1-second audio parts for each utterance, the median of the predicted values will be taken.
Link Link 0.182249525233 0.211179832035 Audio
05.2018 GammaLab
1 Prediction with one single model. This single model uses multi-modal features.
Link Link 0.345427888879 0.484332302684 Vision + Audio
05.2018 GammaLab
2 Prediction with the ensemble model. This model is composed by two different single models.
Link Link 0.355749630594 0.496467636791 Vision + Audio
05.2018 GammaLab
3 Prediction with the ensemble model. This model is composed by three different single models.
Link Link 0.361186170656 0.498808067516 Vision + Audio
05.2018 HKUST-NISL2018
1 Early Fusion 0.
Link 0.276576867376 0.359358612774 Vision + Audio + Text
05.2018 HKUST-NISL2018
2 Early Fusion 1.
Link Link 0.251271551784 0.283373526753 Vision + Audio + Text
05.2018 HKUST-NISL2018
3 Early Fusion 2.
Link Link 0.206451070326 0.284609079282 Vision + Audio + Text
05.2018 iBug
1 we took the convolutional and pooling layers of VGG-FACE followed by a fully connected one with 4096 units. 3 RNNs (each had 2 hidden layers, each of which had 128 units) were stacked on top of it: i) The output of the last convolutional layer of VGG-FACE was given as input to the one RNN, ii) the output of the last pooling layer of VGG-FACE was given as input to the 2nd RNN and iii) the output of the fully connected layer was given as input to the last RNN. Then, the outputs of the RNNs were concatenated and were passed to the output layer that gave the final estimates.
Link Link 0.118797460723 0.389332186482 Vision
05.2018 iBug
2 we took the convolutional and pooling layers of VGG-FACE followed by a fully connected one with 4096 units. The outputs of: i) the last convolutional layer of VGG-FACE, ii) the last pooling layer of VGG-FACE and iii) the fully connected layer, were concatenated and given as input to a 2 layered RNN (each layer with 128 units) stacked on top. Then the output layer followed that gave the final estimates.
Link Link 0.123877229006 0.390609470508 Vision
05.2018 iBug
3 -
Link Link 0.130682478094 0.400427658916 Vision
05.2018 Jakobs Xlab
1 In the first step, we use the Facial Action Coding System (FACS), a fully standardized classification system that codes facial expressions based on anatomic features of human faces. With the FACS, any facial expression can be decomposed as a combination of elementary components called Action Units (AUs). In particular, we use the automated software Emotient FACET, a computer vision program which provides frame-based estimates of the likelihood of 20 AUs.
Link Link 0.206481621848 0.335819579381 Vision
05.2018 Jakobs Xlab
2 In the second step, we leverage an Echo State Network, a variant of Recurrent Neural Network, to learn the coherence between facial expression and valence-arousal values. The hyper parameters of the network is tuned based on 5-fold cross validation.
Link Link 0.209107772472 0.346228315976 Vision
05.2018 UMONS
1 Monomodal feature extraction.
Link Link 0.143405696862 0.251250045128 Vision + Audio + Text
05.2018 UMONS
2 Contextual Multimodal.
Link Link 0.175848359811 0.262762020192 Vision + Audio + Text
05.2018 WCG-WZ
1 Our models include: CNN-Face model, CNN-Visual model, LSTM-Visual model, SVR-Audio model.
CNN-Face/Visual model: First we extract face from each video and construct a face video, and use XCeption to extract features of each frame. Then we average the features above frames. These features were next passed through a three-layer multi-layer perception (MLP) for regression. The hidden layer had 1024 nodes with ReLU activation, and the output layer had only one node with sigmoid activation for arousal, and linear activation for valence. The difference between Face and Visual is that Visual method use raw video frame instead of face video.
LSTM-Visual model: For each utterance, we down-sampled 20 frames uniformly in time (if an utterance had less than 20 frames, then the first frame was repeated to make up 20 frames),and then used InceptionV3 to obtain a 20 × 2048 feature matrix. Next we applied multi-layer long short-term memory (LSTM) to extract the time domain information, and an MLP with 512 hidden nodes and one output node for regression.
SVR-Audio model: We extracted 76 features for each video, and used RReliefF to select feature. After that, the features is used to train a SVR. The features we extracted can be found in the github repository or the paper.
SMLR: SMLR first uses a spectral approach to estimate the accuracies of the base regression models on the testing dataset, and then uses a weighted average to combine the base regression models (the weights are the accuracies of the base models) to obtain the
Link Link 0.149274843823 0.161670034738 Vision + Audio