Synergy of Acoustic-Phonetics and Auditory Modeling Towards Robust Speech Recognition
Deshmukh, Om Dadaji
Espy-Wilson, Carol Y
MetadataShow full item record
The problem addressed in this work is that of enhancing speech signals corrupted by additive noise and improving the performance of automatic speech recognizers in noisy conditions. The enhanced speech signals can also improve the intelligibility of speech in noisy conditions for human listeners with hearing impairment as well as for normal listeners. The original Phase Opponency (PO) model, proposed to detect tones in noise, simulates the processing of the information in neural discharge times and exploits the frequency-dependent phase properties of the tuned filters in the auditory periphery along with the cross-auditory-nerve-fiber coincidence detection to extract temporal cues. The Modified Phase Opponency (MPO) proposed here alters the components of the PO model in such a way that the basic functionality of the PO model is maintained but the various properties of the model can be analyzed and modified independently of each other. This work presents a detailed mathematical formulation of the MPO model and the relation between the properties of the narrowband signal that needs to be detected and the properties of the MPO model. The MPO speech enhancement scheme is based on the premise that speech signals are composed of a combination of narrow band signals (i.e. harmonics) with varying amplitudes. The MPO enhancement scheme outperforms many of the other speech enhancement techniques when evaluated using different objective quality measures. Automatic speech recognition experiments show that replacing noisy speech signals by the corresponding MPO-enhanced speech signals leads to an improvement in the recognition accuracies at low SNRs. The amount of improvement varies with the type of the corrupting noise. Perceptual experiments indicate that: (a) there is little perceptual difference in the MPO-processed clean speech signals and the corresponding original clean signals and (b) the MPO-enhanced speech signals are preferred over the output of the other enhancement methods when the speech signals are corrupted by subway noise but the outputs of the other enhancement schemes are preferred when the speech signals are corrupted by car noise.