TOWARDS EXTENDING ACOUSTIC-TO-ARTICULATORY SPEECH INVERSION AND LEARNING ARTICULATORY REPRESENTATIONS
Files
Publication or External Link
Date
Advisor
Citation
DRUM DOI
Abstract
Acoustic-to-articulatory speech inversion involves the challenging task of deducing the kinematic state of various constriction synergies, including the lips, tongue tip, tongue body, velum, and glottis, based on their respective constriction degree and location coordinates. These coordinates are referred to as vocal tract variables (TVs). Developing Speech Inversion (SI) systems have gained attention over the recent years mainly due to its potential in a wide range of speech applications like Automatic Speech Recognition (ASR), speech synthesis, speech therapy, and mental health assessments.
Over the past few years, deep neural network (DNN) based models have propelled the development of SI systems to new heights. However, the current SI systems still struggle with the lack of sufficiently larger articulatory datasets, speaker dependence, poor performance with noisy speech, and the lack of generalizability across different articulatory datasets. Moreover, one of the major drawbacks of the existing articulatory datasets is the lack of ground-truth data capturing velar and glottal activity of speech. With this work, we try to address some of the aforementioned challenges pertaining to the development of effective SI systems. Our experiments are based on two publicly available articulatory datasets; the University of Wisconsin X-ray microbeam (XRMB) dataset, and the HPRC dataset. We show that the use of appropriate audio augmentation techniques to synthetically create data can further improve the performance of SI systems both on clean and noisy speech data. We also show that the use of multi-task learning frameworks to carry out an auxiliary, but a related task can also improve the TV prediction. A key improvement came about when the SI systems were forced to learn source features (aperiodicity, periodicity, and pitch) as additional targets. Moreover, the use of self-supervised speech representations (HuBERT) and fine tuning them to the downstream task of speech inversion resulted in improved performance.
With the aim of extending the current SI systems to estimate velar and glottal activity, data from an ongoing data collection was used to derive and validate two parameters; nasalance to capture velar constriction degree and electroglottography (EGG) envelope to capture voicing. A separate speaker-independent SI system was subsequently trained to estimate the derived parameters and is one of the first systems to achieve the feat. This SI system along with the conventional SI systems (trained to estimate lip and tongue TVs), provide a framework to estimate a complete articulatory representation of speech in speaker-interdependent fashion.
While improving and extending the current SI frameworks, we also explored an unsupervised learning algorithm inspired by sensorimotor interactions in the human brain to perform audio and speech inversion. The proposed “MirrorNet”, a constrained autoencoder architecture is first used to learn, in an unsupervised manner, the controls of an off-the-shelf audio synthesizer (DIVA) to produce melodies only from their auditory spectrograms. The results demonstrate how the MirrorNet discovers the synthesizer parameters to generate the melodies that closely resemble the original and those of unseen melodies, and even determine the best set of parameters to approximate renditions of complex piano melodies generated by a different synthesizer. To extend the same idea of learning to vocal tract controls for speech, we developed a DNN based articulatory synthesizer (articulatory-to-acoustic forward mapping) to be incorporated as the motor plant of the MirrorNet. The MirrorNet with this motor plant, once initialized with a minimal amount of ground-truth data (~ 30 mins of speech), can learn the articulatory representations (6 TVs + source features) with significantly better accuracy. Overall, this highlights the effectiveness and power of the MirrorNet’s learning algorithm in enabling to solve the conventional acoustic-to-articulatory speech inversion problem with minimal use of ground-truth articulatory data.
In order to assess the practical utility of articulatory representations in real-world scenarios, we employed articulatory coordination features derived from TVs to detect and analyze articulatory-level alterations in the speech of individuals with schizophrenia. We show that the schizophrenia subjects with strong positive symptoms (e.g. hallucinations and delusions), and who are markedly ill, pose a more complex articulatory coordination pattern in facial and speech gestures compared to healthy controls. This distinction in speech coordination pattern is used to train a multimodal convolutional neural network (CNN) which uses video and audio data to distinguish schizophrenia subjects from healthy controls. Furthermore, we used TVs estimated by the best performing SI system to detect mispronunciation of \ɹ, a common speech sound disorder in children. The classification model trained with TVs performed better compared to the state-of-the-art hand-crafted age-and-sex normalized formants.
In essence, the work in this dissertation presents steps taken towards developing effective acoustic-to-articulatory speech inversion frameworks, and highlights the importance of utilizing articulatory representations in real-world applications.