Combining Evidence from Unconstrained Spoken Term Frequency Estimation for Improved Speech Retrieval
Combining Evidence from Unconstrained Spoken Term Frequency Estimation for Improved Speech Retrieval
Files
Publication or External Link
Date
2008-11-21
Authors
Olsson, James Scott
Advisor
Oard, Douglas W
Citation
DRUM DOI
Abstract
This dissertation considers the problem of information retrieval in speech. Today's speech retrieval
systems generally use a large vocabulary continuous speech
recognition system to first hypothesize the words which were spoken.
Because these systems have a predefined lexicon, words which
fall outside of the lexicon can significantly reduce search quality---as measured
by Mean Average Precision (MAP). This is particularly important because these Out-Of-Vocabulary (OOV)
words are often rare and therefore good discriminators for topically relevant speech segments.
The focus of this dissertation is on handling these out-of-vocabulary query words. The approach
is to combine results from a word-based speech retrieval system with those from vocabulary-independent
ranked utterance retrieval. The goal of ranked utterance retrieval is to rank speech utterances
by the system's confidence that they contain a particular spoken word, which is accomplished by ranking
the utterances by the estimated frequency of the word in the utterance. Several
new approaches for estimating this frequency are considered, which are motivated by the disparity between
reference and errorfully hypothesized phoneme sequences. The first method learns alternate pronunciations or
degradations from actual recognition hypotheses and incorporates these variants into a new generative estimator for
term frequency. A second method learns transformations of several easily computed features in a discriminative
model for the same task. Both methods significantly improved ranked utterance retrieval in an experimental
validation on new speech.
The best of these ranked utterance retrieval methods is then combined with a word-based speech retrieval system. The combination
approach uses a normalization learned in an additive model, which maps the retrieval status values from each system into estimated probabilities
of relevance that are easily combined. Using this combination, much of the MAP lost because of OOV words is recovered. Evaluated on a
collection of spontaneous, conversational speech, the system recovers 57.5\% of the MAP lost on short (title-only) queries and
41.3\% on longer (title plus description) queries.