Demystifying Monaural Speech Segregation Models

dc.contributor.advisorShamma, Shihab Aen_US
dc.contributor.authorParikh, Rahilen_US
dc.contributor.departmentElectrical Engineeringen_US
dc.contributor.publisherDigital Repository at the University of Marylanden_US
dc.contributor.publisherUniversity of Maryland (College Park, Md.)en_US
dc.date.accessioned2022-06-15T05:48:05Z
dc.date.available2022-06-15T05:48:05Z
dc.date.issued2022en_US
dc.description.abstractThe ‘cocktail party problem’ is the task of attending to a source of interest, usually speech, in a complex acoustic environment with concurrent sounds. Despite the apparent ease with which humans can group acoustic cues from such an environment and organize them to meaningfully perceive them, the complexity of this problem has inspired generations of neuroscientists, psychologists and engineers to develop multi-disciplinary solutions to this problem, ranging from biologically- inspired frameworks to strictly engineering solutions. In this dissertation we first explore the biologically plausible ‘Temporal Coherence’ algorithm to perform monaural source segregation based on the timing cues of each speaker. This approach integrates biologically plausible feature extraction and hypotheses of sound object perception with current trends in deep learning. It focuses on speech segregation and de-noising in an unsupervised and online fashion. Our findings suggest that this framework is suitable for de-noising applications but is unreliable for segregating mixtures of speech in its current setting. We then explore the recent advancements in deep learning which have led to drastic improvements in speech segregation models. Despite their success and growing applicability, few efforts have been made to analyze the underlying principles that these networks learn to perform segregation. Here we analyze the role of harmonicity on two state-of-the-art Deep Neural Networks (DNN) based models- Conv-TasNet and DPT-Net. We evaluate their performance with mixtures of natural speech versus slightly manipulated inharmonic speech, where harmonics are slightly frequency jittered. We find that performance deteriorates significantly if one source is even slightly harmonically jittered, e.g., an imperceptible 3% harmonic jitter degrades performance of Conv-TasNet from 15.4 dB to 0.70 dB. Training the model on inharmonic speech does not remedy this sensitivity, instead resulting in worse performance on natural speech mixtures, making inharmonicity a powerful adversarial factor in DNN models. Furthermore, additional analyses reveal that DNN algorithms deviate markedly from the biologically inspired Temporal Coherence algorithm. Knowing that harmonicity is a critical cue for these networks to group sources we then perform a thorough investigation on ConvTasnet and DPT-Net to analyze how they perform a harmonic analysis of the input mixture. We perform ablation studies where we apply low-pass, high-pass, and band-stop filters of varying pass-bands to empirically analyze the harmonics most critical for segregation. We also investigate how these networks decide which output channel to assign to an estimated source by introducing discontinuities in synthetic mixtures. We find that end-to-end networks are highly unstable, and perform poorly when confronted with deformations which are imperceptible to humans. Replacing the encoder in these networks with a spectrogram leads to lower overall performance, but much higher stability. This work helps us to understand what information these network rely on for speech segregation, and exposes two sources of generalization-errors. It also pinpoints the encoder as the part of the network responsible for these generalization-errors, allowing for a redesign with expert knowledge or transfer learning. The work in this dissertation helps demystify end-to-end speech segregation networks and takes a step towards solving the cocktail-party-problem.en_US
dc.identifierhttps://doi.org/10.13016/c3qv-divg
dc.identifier.urihttp://hdl.handle.net/1903/28808
dc.language.isoenen_US
dc.subject.pqcontrolledElectrical engineeringen_US
dc.subject.pqcontrolledComputer scienceen_US
dc.subject.pqcontrolledArtificial intelligenceen_US
dc.subject.pquncontrolledAdversarial Inputsen_US
dc.subject.pquncontrolledCocktail Party Problemen_US
dc.subject.pquncontrolledComputational Auditory Scene Analysisen_US
dc.subject.pquncontrolledSpeech Segregationen_US
dc.subject.pquncontrolledTemporal Coherenceen_US
dc.titleDemystifying Monaural Speech Segregation Modelsen_US
dc.typeThesisen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Parikh_umd_0117N_22514.pdf
Size:
19.66 MB
Format:
Adobe Portable Document Format