Speech recognition based on phonetic features and acoustic landmarks

Thumbnail Image


umi-umd-2127.pdf (1.48 MB)
No. of downloads: 1450

Publication or External Link






A probabilistic and statistical framework is presented for automatic speech recognition based on a phonetic feature representation of speech sounds. In this acoustic-phonetic approach, the speech recognition problem is hypothesized as a maximization of the joint posterior probability of a set of phonetic features and the corresponding acoustic landmarks. Binary classifiers of the manner phonetic features - syllabic, sonorant and continuant - are applied for the probabilistic detection of speech landmarks. The landmarks include stop bursts, vowel onsets, syllabic peaks, syllabic dips, fricative onsets and offsets, and sonorant consonant onsets and offsets. The classifiers use automatically extracted knowledge based acoustic parameters (APs) that are acoustic correlates of those phonetic features. For isolated word recognition with known and limited vocabulary, the landmark sequences are constrained using a manner class pronunciation graph. Probabilistic decisions on place and voicing phonetic features are then made using a separate set of APs extracted using the landmarks.

The framework exploits two properties of the knowledge-based acoustic cues of phonetic features: (1) sufficiency of the acoustic cues of a phonetic feature for a decision on that feature and (2) invariance of the acoustic cues with respect to context. The probabilistic framework makes the acoustic-phonetic approach to speech recognition suitable for practical recognition tasks as well as compatible with probabilistic pronunciation and language models. Support vector machines (SVMs) are applied for the binary classification tasks because of their two favorable properties - good generalization and the ability to learn from a relatively small amount of high dimensional data. Performance comparable to Hidden Markov Model (HMM) based systems is obtained on landmark detection as well as isolated word recognition. Applications to rescoring of lattices from a large vocabulary continuous speech recognizer are also presented.