Demystifying Monaural Speech Segregation Models

Thumbnail Image


Publication or External Link





The ‘cocktail party problem’ is the task of attending to a source of interest, usually speech, in a complex acoustic environment with concurrent sounds. Despite the apparent ease with which humans can group acoustic cues from such an environment and organize them to meaningfully perceive them, the complexity of this problem has inspired generations of neuroscientists, psychologists and engineers to develop multi-disciplinary solutions to this problem, ranging from biologically- inspired frameworks to strictly engineering solutions.

In this dissertation we first explore the biologically plausible ‘Temporal Coherence’ algorithm to perform monaural source segregation based on the timing cues of each speaker. This approach integrates biologically plausible feature extraction and hypotheses of sound object perception with current trends in deep learning. It focuses on speech segregation and de-noising in an unsupervised and online fashion. Our findings suggest that this framework is suitable for de-noising applications but is unreliable for segregating mixtures of speech in its current setting.

We then explore the recent advancements in deep learning which have led to drastic improvements in speech segregation models. Despite their success and growing applicability, few efforts have been made to analyze the underlying principles that these networks learn to perform segregation. Here we analyze the role of harmonicity on two state-of-the-art Deep Neural Networks (DNN) based models- Conv-TasNet and DPT-Net. We evaluate their performance with mixtures of natural speech versus slightly manipulated inharmonic speech, where harmonics are slightly frequency jittered. We find that performance deteriorates significantly if one source is even slightly harmonically jittered, e.g., an imperceptible 3% harmonic jitter degrades performance of Conv-TasNet from 15.4 dB to 0.70 dB. Training the model on inharmonic speech does not remedy this sensitivity, instead resulting in worse performance on natural speech mixtures, making inharmonicity a powerful adversarial factor in DNN models. Furthermore, additional analyses reveal that DNN algorithms deviate markedly from the biologically inspired Temporal Coherence algorithm.

Knowing that harmonicity is a critical cue for these networks to group sources we then perform a thorough investigation on ConvTasnet and DPT-Net to analyze how they perform a harmonic analysis of the input mixture. We perform ablation studies where we apply low-pass, high-pass, and band-stop filters of varying pass-bands to empirically analyze the harmonics most critical for segregation. We also investigate how these networks decide which output channel to assign to an estimated source by introducing discontinuities in synthetic mixtures. We find that end-to-end networks are highly unstable, and perform poorly when confronted with deformations which are imperceptible to humans. Replacing the encoder in these networks with a spectrogram leads to lower overall performance, but much higher stability. This work helps us to understand what information these network rely on for speech segregation, and exposes two sources of generalization-errors. It also pinpoints the encoder as the part of the network responsible for these generalization-errors, allowing for a redesign with expert knowledge or transfer learning.

The work in this dissertation helps demystify end-to-end speech segregation networks and takes a step towards solving the cocktail-party-problem.