MODELING ADAPTABILITY MECHANISMS OF SPEECH PERCEPTION Nika Jurov

Loading...
Thumbnail Image

Files

Publication or External Link

Date

2024

Citation

Abstract

Speech is a complex, redundant and variable signal happening in a noisy and ever changing world. How do listeners navigate these complex auditory scenes and continuously and effortlessly understand most of the speakers around them? Studies show that listeners can quickly adapt to new situations, accents and even to distorted speech. Although prior research has established that listeners rely more on some speech cues (or also called features or dimensions) than others, it is yet not understood how listeners weight them flexibly on a moment-to-moment basis when the input might deviate from the standard speech.

This thesis computationally explores flexible cue re-weighting as an adaptation mechanism using real speech corpora. The computational framework it relies on is rate distortion theory. This framework models a channel that is optimized on a trade off between distortion and rate: on the one hand, the input signal should be reconstructed with minimal error after it goes through the channel. On the other hand, the channel needs to extract parsimonious information from the incoming data. This channel can be implemented as a neural network with a beta variational auto-encoder.

We use this model to show that two mechanistic components are needed for adaptation: focus and switch. We firstly show that focus on a cue mimics humans better than cue weights that simply depend on long term statistics as has been largely assumed in the prior research. And second, we show a new model that can quickly adapt and switch weighting the features depending on the input of a particular moment. This model's flexibility comes from implementing a cognitive mechanism that has been called ``selective attention" with multiple encoders. Each encoder serves as a focus on a different part of the signal. We can then choose how much to rely on each focus depending on the moment.

Finally, we ask whether cue weighting is informed by being able to separate the noise from speech. To this end we adapt a feature disentanglement adversarial training from vision to disentangle speech (noise) features from noise (speech) labels. We show that although this does not give us human-like cue weighting behavior, there is an effect of disentanglement of weighting spectral information slightly more than temporal information compared to the baselines.

Overall, this thesis explores adaptation computationally and offers a possible mechanistic explanation for ``selective attention'' with focus and switch mechanisms, based on rate distortion theory. It also argues that cue weighting cannot be determined solely on speech carefully articulated in laboratories or in quiet. Lastly, it explores a way to inform speech models from a cognitive angle to make the models more flexible and robust, like human speech perception is.

Notes

Rights