Generalizable Depression Detection and Severity Prediction Using Articulatory Representations of Speech

Loading...
Thumbnail Image

Files

Seneviratne_umd_0117E_22339.pdf (6.36 MB)
(RESTRICTED ACCESS)
No. of downloads:

Publication or External Link

Date

2022

Citation

Abstract

Major Depressive Disorder (MDD) is a mental health disorder that has taken a massive toll on society both socially and financially. Timely diagnosis of MDD is extremely crucial to minimize serious consequences such as suicide. Hence automated solutions that can reliably detect and predict the severity of MDD can play a pivotal role in assisting healthcare professionals in providing timely treatments. MDD is known to affect speech. Leveraging on the changes in speech characteristics that occur due to depression, a lot of vocal biomarkers are being developed to detect depression. However, the study into changes in articulatory coordination associated with depression is under-explored. Speech articulation is a complex activity that requires finely timed coordination across articulators. In a depressed state involving psychomotor slowing, this coordination changes and in turn modifies the perceived speech signal.

In this work, we use a direct representation of articulation known as vocal tract variables (TVs) to capture the coordination between articulatory gestures. TVs define the constriction degree and location of articulators (tongue, jaw, lips, velum and glottis). Previously, correlation structure of formants or mel-frequency cepstral coefficients (MFCCs) were used as a proxy for underlying articulatory coordination. We compute the articulatory coordination features (ACFs) which provide details about the correlation among time-series data at different time delays and are therefore rich with information about the underlying coordination level of speech production. Using the rank-ordered eigenspectra obtained from TV based ACFs, we show that depressed speech depicts simpler coordination relative to the speech of the same subjects when in remission which is inline with previous findings. By conducting a preliminary study using a small subset of speech from subjects who transitioned from being severely depressed to being in remission, we show that TV based ACFs outperform formant based ACFs in binary depression classification. We show that depressed speech has reduced variability in terms of reduced coarticulation and undershoot. To validate this, we present a comprehensive acoustic analysis and results of a speech-in-noise perception study to compare the intelligibility of depressed speech relative to not-depressed speech. Our results indicate that depressed speech is at least as intelligible as not-depressed speech.

The next stage of our work focuses on developing deep learning based models using TV based ACFs to detect depression and attempts to overcome the limitations in existing work. We combine two speech depression databases with different characteristics which helps to increase the generalizability which is a key objective of this research. Moreover, we segment audio recordings prior to feature extraction to obtain data volumes required to train deep neural networks. We reduce the dimensionality of conventional stacked ACFs of multiple delay scales by using refined ACFs which are carefully curated to remove redundancies and using the strengths of dilated Convolutional Neural Networks. We show that models trained on TV based ACFs are more generalizable compared to its proxy counterparts. Then we develop a multi-stage convolutional recurrent neural network that performs classification at the session-level. We derive the constraints under which this segment-to-session level approach could be used to boost the classification performance. We extend our models to perform depression severity level classification. The TV based ACFs outperform other feature sets in this task as well.

The language pattern and semantics can reveal vital information regarding a person's mental state. We develop a multimodal depression classifier which incorporates TV based ACFs and hierarchical attention based text embeddings. The fusion strategy of the proposed architecture enables segmenting data from different modalities independently (overlapping segments for audio and sentences for text), in the most optimal way for each modality, when performing segment-to-session level classification. The multimodal classifier clearly performs better than the unimodal classifiers. Finally, we develop a multimodal system to predict the depression severity score, which is a more challenging regression problem due to the quasi-numerical nature of the scores. Multimodal regressor achieves the lowest root mean squared error showing the synergies of combining multiple modalities such as audio and text. We perform an exhaustive error analysis that reveals potential improvements to be made in the future.

The work in this dissertation takes a step forward towards the betterment of humanity by exploring the development of technologies to improve the performance of speech based depression assessment, utilizing the strengths of the ACFs derived from direct articulatory representations.

Notes

Rights