Results 2018
Results of the 2018 OMG-Emotion Recognition Challenge
Date | Team | Submission | Repository | Paper | CCC Arousal | CCC Valence | Modality |
---|---|---|---|---|---|---|---|
05.2018 | ADSC | 1
We only take the video frames as input. First, as preprocessing, we apply MTCNN to crop and align faces, among which we randomly pick 16 images (in order) of an utterance as input. Then, the base network SphereFace outputs their features and a bidirectional LSTM is used. Finally, after averaging the LSTM output features, an FC and a tanh layer are applied to regress arousal-valence values. We use MSE loss.
|
Link | Link | 0.244058303158 | 0.437829242686 | Vision |
05.2018 | ADSC | 2 We keep the same network as submission #1 for video, and add another network for audio. First, as preprocessing, we extract WAV file of every utterance and acquire STFT maps every 10 ms. Then, 4 randomly chosen maps inputs to the base network: a modified VGG-16. After averaging the output features, we concatenate the audio feature and the video features. Finally, an FC and a tanh layer are applied to regress arousal-valence values. We use CCC for the joint training loss.
|
Link | Link | 0.236030982713 | 0.442048822313 | Vision + Audio |
05.2018 | audEERING | 1
Valence: Deep video features using VGGface and (B)LSTM for sequence prediction. Samples are shuffled within train and dev folds.
Arousal: openSMILE features with (B)LSTM for sequence prediction. Samples are shuffled within train and dev folds.
|
Link | Link | 0.2766972457 | 0.257974288965 | Arousal: Audio, Valence: Video |
05.2018 | audEERING | 2
We keep the same network as submission #1 for video, and add another network for audio. First, as preprocessing, we extract WAV file of every utterance and acquire STFT maps every 10 ms. Then, 4 randomly chosen maps inputs to the base network: a modified VGG-16. After averaging the output features, we concatenate the audio feature and the video features. Finally, an FC and a tanh layer are applied to regress arousal-valence values. We use CCC for the joint training loss.
|
Link | Link | 0.286322843804 | 0.368516561137 | Arousal: Audio, Valence:Audio+Video |
05.2018 | audEERING | 3
We keep the same network as submission #1 for video, and add another network for audio. First, as preprocessing, we extract WAV file of every utterance and acquire STFT maps every 10 ms. Then, 4 randomly chosen maps inputs to the base network: a modified VGG-16. After averaging the output features, we concatenate the audio feature and the video features. Finally, an FC and a tanh layer are applied to regress arousal-valence values. We use CCC for the joint training loss.
|
Link | Link | 0.292541922664 | 0.361123256087 | Audio |
05.2018 | EMO-INESC | 1
The implemented methodology is an ensemble of several models from two distinct modalities, namely video and text.
|
Link | Link | 0.14402485704 | 0.344500137514 | Vision + Text |
05.2018 | ExCouple | 1
The proposed model is for the audio module. All videos in the OMG Emotion dataset are converted to WAV files. In the presented process we make use of semi-supervised learning for the emotion recognition. A GAN is trained with unsupervised learning using another database (IEMOCAP), and part of the GAN autoencoder will be used for the audio representation.
The audio spectrogram will be extracted in 1-second windows with 16kHz frequency and this will serve as input to the audio representation model. This audio representation will serve as input to a convolutional network and a Dense layer with 'tanh' activation that performs the prediction of Arousal and Valence values. To join the 1-second audio parts for each utterance, the median of the predicted values will be taken.
|
Link | Link | 0.182249525233 | 0.211179832035 | Audio |
05.2018 | GammaLab | 1
Prediction with one single model. This single model uses multi-modal features.
|
Link | Link | 0.345427888879 | 0.484332302684 | Vision + Audio |
05.2018 | GammaLab | 2
Prediction with the ensemble model. This model is composed by two different single models.
|
Link | Link | 0.355749630594 | 0.496467636791 | Vision + Audio |
05.2018 | GammaLab | 3
Prediction with the ensemble model. This model is composed by three different single models.
|
Link | Link | 0.361186170656 | 0.498808067516 | Vision + Audio |
05.2018 | HKUST-NISL2018 | 1
Early Fusion 0.
|
Link | 0.276576867376 | 0.359358612774 | Vision + Audio + Text | |
05.2018 | HKUST-NISL2018 | 2
Early Fusion 1.
|
Link | Link | 0.251271551784 | 0.283373526753 | Vision + Audio + Text |
05.2018 | HKUST-NISL2018 | 3
Early Fusion 2.
|
Link | Link | 0.206451070326 | 0.284609079282 | Vision + Audio + Text |
05.2018 | iBug | 1
we took the convolutional and pooling layers of VGG-FACE followed by a fully connected one with 4096 units. 3 RNNs (each had 2 hidden layers, each of which had 128 units) were stacked on top of it: i) The output of the last convolutional layer of VGG-FACE was given as input to the one RNN, ii) the output of the last pooling layer of VGG-FACE was given as input to the 2nd RNN and iii) the output of the fully connected layer was given as input to the last RNN. Then, the outputs of the RNNs were concatenated and were passed to the output layer that gave the final estimates.
|
Link | Link | 0.118797460723 | 0.389332186482 | Vision |
05.2018 | iBug | 2
we took the convolutional and pooling layers of VGG-FACE followed by a fully connected one with 4096 units. The outputs of: i) the last convolutional layer of VGG-FACE, ii) the last pooling layer of VGG-FACE and iii) the fully connected layer, were concatenated and given as input to a 2 layered RNN (each layer with 128 units) stacked on top. Then the output layer followed that gave the final estimates.
|
Link | Link | 0.123877229006 | 0.390609470508 | Vision |
05.2018 | iBug | 3
-
|
Link | Link | 0.130682478094 | 0.400427658916 | Vision |
05.2018 | Jakobs Xlab | 1
In the first step, we use the Facial Action Coding System (FACS), a fully standardized classification system that codes facial expressions based on anatomic features of human faces. With the FACS, any facial expression can be decomposed as a combination of elementary components called Action Units (AUs). In particular, we use the automated software Emotient FACET, a computer vision program which provides frame-based estimates of the likelihood of 20 AUs.
|
Link | Link | 0.206481621848 | 0.335819579381 | Vision |
05.2018 | Jakobs Xlab | 2
In the second step, we leverage an Echo State Network, a variant of Recurrent Neural Network, to learn the coherence between facial expression and valence-arousal values. The hyper parameters of the network is tuned based on 5-fold cross validation.
|
Link | Link | 0.209107772472 | 0.346228315976 | Vision |
05.2018 | UMONS | 1
Monomodal feature extraction.
|
Link | Link | 0.143405696862 | 0.251250045128 | Vision + Audio + Text |
05.2018 | UMONS | 2
Contextual Multimodal.
|
Link | Link | 0.175848359811 | 0.262762020192 | Vision + Audio + Text |
05.2018 | WCG-WZ | 1
Our models include: CNN-Face model, CNN-Visual model, LSTM-Visual model, SVR-Audio model.
CNN-Face/Visual model: First we extract face from each video and construct a face video, and use XCeption to extract features of each frame. Then we average the features above frames. These features were next passed through a three-layer multi-layer perception (MLP) for regression. The hidden layer had 1024 nodes with ReLU activation, and the output layer had only one node with sigmoid activation for arousal, and linear activation for valence. The difference between Face and Visual is that Visual method use raw video frame instead of face video.
LSTM-Visual model: For each utterance, we down-sampled 20 frames uniformly in time (if an utterance had less than 20 frames, then the first frame was repeated to make up 20 frames),and then used InceptionV3 to obtain a 20 × 2048 feature matrix. Next we applied multi-layer long short-term memory (LSTM) to extract the time domain information, and an MLP with 512 hidden nodes and one output node for regression.
SVR-Audio model: We extracted 76 features for each video, and used RReliefF to select feature. After that, the features is used to train a SVR. The features we extracted can be found in the github repository or the paper.
SMLR: SMLR first uses a spectral approach to estimate the accuracies of the base regression models on the testing dataset, and then uses a weighted average to combine the base regression models (the weights are the accuracies of the base models) to obtain the
|
Link | Link | 0.149274843823 | 0.161670034738 | Vision + Audio |