OMG-Emotion Behavior Challenge

Results 2018

Results of the 2018 OMG-Emotion Recognition Challenge

Date	Team	Submission	Repository	Paper	CCC Arousal	CCC Valence	Modality
05.2018	ADSC	1 We only take the video frames as input. First, as preprocessing, we apply MTCNN to crop and align faces, among which we randomly pick 16 images (in order) of an utterance as input. Then, the base network SphereFace outputs their features and a bidirectional LSTM is used. Finally, after averaging the LSTM output features, an FC and a tanh layer are applied to regress arousal-valence values. We use MSE loss.	Link	Link	0.244058303158	0.437829242686	Vision
05.2018	ADSC	2 We keep the same network as submission #1 for video, and add another network for audio. First, as preprocessing, we extract WAV file of every utterance and acquire STFT maps every 10 ms. Then, 4 randomly chosen maps inputs to the base network: a modified VGG-16. After averaging the output features, we concatenate the audio feature and the video features. Finally, an FC and a tanh layer are applied to regress arousal-valence values. We use CCC for the joint training loss.	Link	Link	0.236030982713	0.442048822313	Vision + Audio
05.2018	audEERING	1 Valence: Deep video features using VGGface and (B)LSTM for sequence prediction. Samples are shuffled within train and dev folds. Arousal: openSMILE features with (B)LSTM for sequence prediction. Samples are shuffled within train and dev folds.	Link	Link	0.2766972457	0.257974288965	Arousal: Audio, Valence: Video
05.2018	audEERING	2 We keep the same network as submission #1 for video, and add another network for audio. First, as preprocessing, we extract WAV file of every utterance and acquire STFT maps every 10 ms. Then, 4 randomly chosen maps inputs to the base network: a modified VGG-16. After averaging the output features, we concatenate the audio feature and the video features. Finally, an FC and a tanh layer are applied to regress arousal-valence values. We use CCC for the joint training loss.	Link	Link	0.286322843804	0.368516561137	Arousal: Audio, Valence:Audio+Video
05.2018	audEERING	3 We keep the same network as submission #1 for video, and add another network for audio. First, as preprocessing, we extract WAV file of every utterance and acquire STFT maps every 10 ms. Then, 4 randomly chosen maps inputs to the base network: a modified VGG-16. After averaging the output features, we concatenate the audio feature and the video features. Finally, an FC and a tanh layer are applied to regress arousal-valence values. We use CCC for the joint training loss.	Link	Link	0.292541922664	0.361123256087	Audio
05.2018	EMO-INESC	1 The implemented methodology is an ensemble of several models from two distinct modalities, namely video and text.	Link	Link	0.14402485704	0.344500137514	Vision + Text
05.2018	ExCouple	1 The proposed model is for the audio module. All videos in the OMG Emotion dataset are converted to WAV files. In the presented process we make use of semi-supervised learning for the emotion recognition. A GAN is trained with unsupervised learning using another database (IEMOCAP), and part of the GAN autoencoder will be used for the audio representation. The audio spectrogram will be extracted in 1-second windows with 16kHz frequency and this will serve as input to the audio representation model. This audio representation will serve as input to a convolutional network and a Dense layer with 'tanh' activation that performs the prediction of Arousal and Valence values. To join the 1-second audio parts for each utterance, the median of the predicted values will be taken.	Link	Link	0.182249525233	0.211179832035	Audio
05.2018	GammaLab	1 Prediction with one single model. This single model uses multi-modal features.	Link	Link	0.345427888879	0.484332302684	Vision + Audio
05.2018	GammaLab	2 Prediction with the ensemble model. This model is composed by two different single models.	Link	Link	0.355749630594	0.496467636791	Vision + Audio
05.2018	GammaLab	3 Prediction with the ensemble model. This model is composed by three different single models.	Link	Link	0.361186170656	0.498808067516	Vision + Audio
05.2018	HKUST-NISL2018	1 Early Fusion 0.	Link		0.276576867376	0.359358612774	Vision + Audio + Text
05.2018	HKUST-NISL2018	2 Early Fusion 1.	Link	Link	0.251271551784	0.283373526753	Vision + Audio + Text
05.2018	HKUST-NISL2018	3 Early Fusion 2.	Link	Link	0.206451070326	0.284609079282	Vision + Audio + Text
05.2018	iBug	1 we took the convolutional and pooling layers of VGG-FACE followed by a fully connected one with 4096 units. 3 RNNs (each had 2 hidden layers, each of which had 128 units) were stacked on top of it: i) The output of the last convolutional layer of VGG-FACE was given as input to the one RNN, ii) the output of the last pooling layer of VGG-FACE was given as input to the 2nd RNN and iii) the output of the fully connected layer was given as input to the last RNN. Then, the outputs of the RNNs were concatenated and were passed to the output layer that gave the final estimates.	Link	Link	0.118797460723	0.389332186482	Vision
05.2018	iBug	2 we took the convolutional and pooling layers of VGG-FACE followed by a fully connected one with 4096 units. The outputs of: i) the last convolutional layer of VGG-FACE, ii) the last pooling layer of VGG-FACE and iii) the fully connected layer, were concatenated and given as input to a 2 layered RNN (each layer with 128 units) stacked on top. Then the output layer followed that gave the final estimates.	Link	Link	0.123877229006	0.390609470508	Vision
05.2018	iBug	3 -	Link	Link	0.130682478094	0.400427658916	Vision
05.2018	Jakobs Xlab	1 In the first step, we use the Facial Action Coding System (FACS), a fully standardized classification system that codes facial expressions based on anatomic features of human faces. With the FACS, any facial expression can be decomposed as a combination of elementary components called Action Units (AUs). In particular, we use the automated software Emotient FACET, a computer vision program which provides frame-based estimates of the likelihood of 20 AUs.	Link	Link	0.206481621848	0.335819579381	Vision
05.2018	Jakobs Xlab	2 In the second step, we leverage an Echo State Network, a variant of Recurrent Neural Network, to learn the coherence between facial expression and valence-arousal values. The hyper parameters of the network is tuned based on 5-fold cross validation.	Link	Link	0.209107772472	0.346228315976	Vision
05.2018	UMONS	1 Monomodal feature extraction.	Link	Link	0.143405696862	0.251250045128	Vision + Audio + Text
05.2018	UMONS	2 Contextual Multimodal.	Link	Link	0.175848359811	0.262762020192	Vision + Audio + Text
05.2018	WCG-WZ	1 Our models include: CNN-Face model, CNN-Visual model, LSTM-Visual model, SVR-Audio model. CNN-Face/Visual model: First we extract face from each video and construct a face video, and use XCeption to extract features of each frame. Then we average the features above frames. These features were next passed through a three-layer multi-layer perception (MLP) for regression. The hidden layer had 1024 nodes with ReLU activation, and the output layer had only one node with sigmoid activation for arousal, and linear activation for valence. The difference between Face and Visual is that Visual method use raw video frame instead of face video. LSTM-Visual model: For each utterance, we down-sampled 20 frames uniformly in time (if an utterance had less than 20 frames, then the ﬁrst frame was repeated to make up 20 frames),and then used InceptionV3 to obtain a 20 × 2048 feature matrix. Next we applied multi-layer long short-term memory (LSTM) to extract the time domain information, and an MLP with 512 hidden nodes and one output node for regression. SVR-Audio model: We extracted 76 features for each video, and used RReliefF to select feature. After that, the features is used to train a SVR. The features we extracted can be found in the github repository or the paper. SMLR: SMLR ﬁrst uses a spectral approach to estimate the accuracies of the base regression models on the testing dataset, and then uses a weighted average to combine the base regression models (the weights are the accuracies of the base models) to obtain the	Link	Link	0.149274843823	0.161670034738	Vision + Audio

OMG-Emotion Challenge 2018

Results

Results 2018

Copyright© - OMG-Challenges 2018