Demystifying Monaural Speech Segregation Models

Parikh, Rahil

Demystifying Monaural Speech Segregation Models

dc.contributor.advisor	Shamma, Shihab A	en_US
dc.contributor.author	Parikh, Rahil	en_US
dc.contributor.department	Electrical Engineering	en_US
dc.contributor.publisher	Digital Repository at the University of Maryland	en_US
dc.contributor.publisher	University of Maryland (College Park, Md.)	en_US
dc.date.accessioned	2022-06-15T05:48:05Z
dc.date.available	2022-06-15T05:48:05Z
dc.date.issued	2022	en_US
dc.description.abstract	The ‘cocktail party problem’ is the task of attending to a source of interest, usually speech, in a complex acoustic environment with concurrent sounds. Despite the apparent ease with which humans can group acoustic cues from such an environment and organize them to meaningfully perceive them, the complexity of this problem has inspired generations of neuroscientists, psychologists and engineers to develop multi-disciplinary solutions to this problem, ranging from biologically- inspired frameworks to strictly engineering solutions. In this dissertation we first explore the biologically plausible ‘Temporal Coherence’ algorithm to perform monaural source segregation based on the timing cues of each speaker. This approach integrates biologically plausible feature extraction and hypotheses of sound object perception with current trends in deep learning. It focuses on speech segregation and de-noising in an unsupervised and online fashion. Our findings suggest that this framework is suitable for de-noising applications but is unreliable for segregating mixtures of speech in its current setting. We then explore the recent advancements in deep learning which have led to drastic improvements in speech segregation models. Despite their success and growing applicability, few efforts have been made to analyze the underlying principles that these networks learn to perform segregation. Here we analyze the role of harmonicity on two state-of-the-art Deep Neural Networks (DNN) based models- Conv-TasNet and DPT-Net. We evaluate their performance with mixtures of natural speech versus slightly manipulated inharmonic speech, where harmonics are slightly frequency jittered. We find that performance deteriorates significantly if one source is even slightly harmonically jittered, e.g., an imperceptible 3% harmonic jitter degrades performance of Conv-TasNet from 15.4 dB to 0.70 dB. Training the model on inharmonic speech does not remedy this sensitivity, instead resulting in worse performance on natural speech mixtures, making inharmonicity a powerful adversarial factor in DNN models. Furthermore, additional analyses reveal that DNN algorithms deviate markedly from the biologically inspired Temporal Coherence algorithm. Knowing that harmonicity is a critical cue for these networks to group sources we then perform a thorough investigation on ConvTasnet and DPT-Net to analyze how they perform a harmonic analysis of the input mixture. We perform ablation studies where we apply low-pass, high-pass, and band-stop filters of varying pass-bands to empirically analyze the harmonics most critical for segregation. We also investigate how these networks decide which output channel to assign to an estimated source by introducing discontinuities in synthetic mixtures. We find that end-to-end networks are highly unstable, and perform poorly when confronted with deformations which are imperceptible to humans. Replacing the encoder in these networks with a spectrogram leads to lower overall performance, but much higher stability. This work helps us to understand what information these network rely on for speech segregation, and exposes two sources of generalization-errors. It also pinpoints the encoder as the part of the network responsible for these generalization-errors, allowing for a redesign with expert knowledge or transfer learning. The work in this dissertation helps demystify end-to-end speech segregation networks and takes a step towards solving the cocktail-party-problem.	en_US
dc.identifier	https://doi.org/10.13016/c3qv-divg
dc.identifier.uri	http://hdl.handle.net/1903/28808
dc.language.iso	en	en_US
dc.subject.pqcontrolled	Electrical engineering	en_US
dc.subject.pqcontrolled	Computer science	en_US
dc.subject.pqcontrolled	Artificial intelligence	en_US
dc.subject.pquncontrolled	Adversarial Inputs	en_US
dc.subject.pquncontrolled	Cocktail Party Problem	en_US
dc.subject.pquncontrolled	Computational Auditory Scene Analysis	en_US
dc.subject.pquncontrolled	Speech Segregation	en_US
dc.subject.pquncontrolled	Temporal Coherence	en_US
dc.title	Demystifying Monaural Speech Segregation Models	en_US
dc.type	Thesis	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Parikh_umd_0117N_22514.pdf
Size:: 19.66 MB
Format:: Adobe Portable Document Format

Download

Collections

UMD Theses and Dissertations
Electrical & Computer Engineering Theses and Dissertations