Discrimination of Speech From Non-Speech Based on Multiscale Spectro-Temporal Modulations
MetadataShow full item record
We describe a content-based audio classification algorithm based on novel multiscale spectrotemporal modulation features inspired by a model of auditory cortical processing. The task explored is to discriminate speech from non-speech consisting of animal vocalizations, music and environmental sounds. Although this is a relatively easy task for humans, it is still difficult to automate well, especially in noisy and reverberant environments. The auditory model captures basic processes occurring from the early cochlear stages to the central cortical areas. The model generates a multidimensional spectro-temporal representation of the sound, which is then analyzed by a multi-linear dimensionality reduction technique and classified by a Support Vector Machine (SVM). Generalization of the system to signals in high level of additive noise and reverberation is evaluated and compared to two existing approaches  . The results demonstrate the advantages of the auditory model over the other two systems, especially at low SNRs and high reverberation.