TOWARDS BUILDING GENERALIZABLE SPEECH EMOTION RECOGNITION MODELS
Files
Publication or External Link
Date
Authors
Advisor
Citation
DRUM DOI
Abstract
Abstract:
Detecting the mental state of a person has implications in psychiatry, medicine, psychology and human-computer interaction systems among others. It includes (but is not limited to) a wide variety of problems such as emotion detection, valence-affect-dominance states prediction, mood detection and detection of clinical depression. In this thesis we focus primarily on emotion recognition. Like any recognition system, building an emotion recognition model consists of the following two steps:
-
Extraction of meaningful features that would help in classification
-
Development of an appropriate classifier
Speech data being non-invasive and the ease with which it can be collected has made it a popular candidate for feature extraction. However, an ideal system designed should be agnostic to speaker and channel effects. While feature normalization schemes can counter these problems to some extent, we still see a drastic drop in performance when the training and test data-sets are unmatched. In this dissertation we explore some novel ways towards building models that are more robust to speaker and domain differences.
Training discriminative classifiers involves learning a conditional distribution p(y_i|x_i), given a set of feature vectors x_i and the corresponding labels y_i; i=1,...N. For a classifier to be generalizable and not overfit to training data, the resulting conditional distribution p(y_i|x_i) is desired to be smoothly varying over the inputs x_i. Adversarial training procedures enforce this smoothness using manifold regularization techniques. Manifold regularization makes the model’s output distribution more robust to local perturbation added to a datapoint x_i. In the first part of the dissertation, we investigate two training procedures: (i) adversarial training where we determine the perturbation direction based on the given labels for the training data and, (ii) virtual adversarial training where we determine the perturbation direction based only on the output distribution of the training data. We demonstrate the efficacy of adversarial training procedures by performing a k-fold cross validation experiment on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) and a cross-corpus performance analysis on three separate corpora. We compare their performances to that of a model utilizing other regularization schemes such as L1/L2 and graph based manifold regularization scheme. Results show improvement over a purely supervised approach, as well as better generalization capability to cross-corpus settings.
Our second approach to better discriminate between emotions leverages multi-modal learning and automated speech recognition (ASR) systems toward improving the generalizability of an emotion recognition model that requires only speech as input. Previous studies have shown that emotion recognition models using only acoustic features do not perform satisfactorily in detecting valence level. Text analysis has been shown to be helpful for sentiment classification. We compared classification accuracies obtained from an audio-only model, a text-only model and a multi-modal system leveraging both by performing a cross-validation analysis on IEMOCAP dataset. Confusion matrices show it’s the valence level detection that is being improved by incorporating textual information. In the second stage of experiments, we used three ASR application programming interfaces (APIs) to get the transcriptions. We compare the performances of multi-modal systems using the ASR transcriptions with each other and with that of one using ground truth transcription. This is followed by a cross-corpus study.
In the third part of the study we investigate the generalizability of generative of generative adversarial networks (GANs) based models. GANs have gained a lot of attention from machine learning community due to their ability to learn and mimic an input data distribution. GANs consist of a discriminator and a generator working in tandem playing a min-max game to learn a target underlying data distribution; when fed with data-points sampled from a simpler distribution (like uniform or Gaussian distribution). Once trained, they allow synthetic generation of examples sampled from the target distribution. We investigate the applicability of GANs to get lower dimensional representations from the higher dimensional feature vectors pertinent for emotion recognition. We also investigate their ability to generate synthetic higher dimensional feature vectors using points sampled from a lower dimensional prior. Specifically, we investigate two set ups: (i) when the lower dimensional prior from which synthetic feature vectors are generated is pre-defined, (ii) when the distribution of lower dimensional prior is learned from training data. We define the metrics that we used to measure and analyze the performance of these generative models in different train/test conditions. We perform cross validation analysis followed by a cross-corpus study.
Finally we make an attempt towards understanding the relation between two different sub-problems encompassed under mental state detection namely depression detection and emotion recognition. We propose approaches that can be investigated to build better depression detection models by leveraging our ability to recognize emotions accurately.