Institute for Systems Research
Permanent URI for this communityhttp://hdl.handle.net/1903/4375
Browse
Search Results
Item Robust Speech Recognition by Topology Preserving Adaptation(2000) Sonmez, Kemal S.; Baras, John S.; ISRThe performance degradation as a result of acoustical environment mismatch remains an important practical problem in speech recognition.The problem carries a greater significance in applications overtelecommunication channels, especially with the wider use ofpersonal communications systems such as cellular phoneswhich invariably present challenging acoustical conditions. Such conditions are difficult to model analytically for a generalspeech representation, and most existing data-driven models require simultaneous ("stereo") recordings of training and testing environments,impractical to collect in most cases of interest.In this dissertation, we propose an invariance principle fornon-parametric speech representations in acoustical environments.We stipulate that the topology of the codevectors in a vector quantization (VQ) codebookas defined in terms of class posterior distributionswill be preserved in a certain information-theoretic sense,and make this invariance principle our basis in deriving normalizationalgorithms that correct for the acoustical mismatch betweenenvironments.
We develop topology preserving algorithms in two frameworks, constrained distortionminimization (VQ with a topology preservation constraint) andinformation geometry (alternating minimization with a topology preservation constraint) and show their equivalence.Finally, we report results on the Wall Street Journal data,the Spoken Speed Dial corpus and the TI Cellular Corpus.
The algorithm is shown to improve performancesignificantly in all three tasks, most notably in the more difficult problemof cellular hands free microphone speech wherethe technique decreases theword error for continuous ten digit recognition from 23.8% to 13.6% and the speaker dependent voice callingsentence error from 16.5% to 10.6%.
Item Spectro-Temporal Modulation Transfer Functions and Speech Intelligibility(1999) Chi, Taishih; Gao, Yujie; Guyton, Matthew C.; Ru, Powen; Shamma, Shihab; ISR; CAARDetection thresholds for spectral and temporal modulations are measuredusing broadband spectra with sinusoidally rippled profiles that drift up or down the log-frequency axis at constant velocities. Spectro-temporal Modulation Transfer Functions (MTF) are derived as a function of ripple peak density (cycles/octave) and drifting velocity (Hz). MTFs exhibit a lowpass function with respect to both dimensions, with 50 percent bandwidths ofabout 16 Hz and 2 cycles/octave. The data replicate (as special cases) previously measured purely temporal MTFs [Viemeister, 1979] and purely spectral MTFs [Green, 1986].We present a computational auditory model that exhibits spectro-temporal MTFs consistent with the salient trends in the data.The model is used to demonstrate the potential relevance of these MTFsto the assessment of speech intelligibility in noise and reverberantconditions.
Item Service Integration in Next Generation VSAT Networks(1997) Hadjitheodosiou, Michael H.; ISR; CSHCNVery Small Aperture Terminal (VSAT) satellite networks have so far been successful in the provision of specific communication services to geographically dispersed users. However, user demands are becoming more complex, and VSAT networks are expected to provide a much wider range of services (voice, data and multimedia). We investigate how this service integration could be achieved and show that performance improvements are possible if efficient multi-access protocols and speech compression with voice activity detection techniques are used. We also discuss the future role VSATs could play in the provision of access to the Integrated Broadband Communications Network to remote users. We discuss the possibility of using VSATs for ATM service provision. The need for careful consideration of the advantages and limitations of using VSAT networks for this type of service is discussed. Finally, we highlight a method for dynamic bandwidth allocation in a broadband satellite network.Item Application of Auditory Representations on Speaker Identification(1997) Chi, Taishih; Shamma, S. A.; ISRThe noise-robustness of auditory spectrum and cortical representation is examined by applying it to text-independent speaker identification tasks. A Bayes classifier residing on an M-ary hypothesis test is employed to evaluate the robustness of the auditory cepstrum and demonstrate its superior performance to that of the well-studied mel-cepstrum. In addition, the phase feature of the wavelet-transform based multiscale cortical representation is shown to be much more stable than the magnitude feature in characterizing speakers by correlator technique, which is traditionally used in scene matching application. This observation is consistent with physiological and psychoacoustic phenomena. The underlying purpose of this study is to inspect the inherent robustness of auditory representations derived from a human perception-based model. The experimental results indicate that biologically motivated features significantly enhance speaker identification accuracy in noisy environments.Item Soft-Decision Decoding for DPSK-Modulated Wireless Voice Communications(1996) Chen, Shih-I; ISRThis thesis addresses some techniques that enhance a receiver's performance in a wireless voice communication system where differential phase shift keying (DPSK) is the adopted modulation scheme and soft-decision decoding is used to improve the effectiveness of the channel coding scheme.First, several fundamental issues regarding the statistical properties of fading channels are provided. We demonstrate the constraints, that must be satisfied so that the channel can be regarded as impaired by ﲦlat (i.e., non-frequency-selective) fading with a constant fading factor over each symbol duration. Throughout this thesis these constraints are assumed to be satisfied.
We next investigate the channel capacity and cutoff rates for fading channels with DPSK-modulated input signals and perfect symbol interleaving. The impact of the channel state information (CSI), on these information-theoretic limits is also discussed. We introduce several symbol metrics for soft-decision decoding, and their performance is investigated by analytical derivation as well as by simulation. Furthermore, we define a bit metric for DQPSK modulation, and compare this bit metric to dibit (symbol) metrics.
We then consider the problem of error concealment for mobile radio, communications with a maximum-likelihood soft- decision decoder. A normalized codeword reliability is defined as the decision reliability information when CSI is not available. We employ a given, interpolation algorithm on a particular land mobile radio system and design a rule for selecting unreliable codewords. Simulation results show that error concealment can decrease the minimum operational signal-noise ratio (SNR) by 3 dB or more.
Finally, we address the problem of exploiting the residual redundancy in the source to enhance the channel decoder's performance -i.e., maximum a posteriori (MAP) decoding. We use two simple source models to demonstrate that MAP decoding can achieve significant gain over maximum-likelihood decoding. We employ a practical CELP-based land mobile radio system to show that significant residual redundancy does exist in the output of some source encoders. Simulation results show that a 2 - 3 dB gain can be achieved by MAP decoding (over ML) at low SNR.
Item Channel Codes That Exploit the Residual Redundancy in CELP- Encoded Speech(1996) Alajaji, Fady; Phamdo, N.; Fuja, Tom E.; ISRWe consider the problem of reliably transmitting CELP-encoded speech over noisy communication channels. Our objective is to design efficient coding/decoding schemes for the transmission of the CELP line spectral parameters (LSP's) over very noisy channels.We begin by quantifying the amount of ﲲesidual redundancy inherent in the LSP's of Federal Standard 1016 CELP. This is done by modeling the LSP's as first and second-order Markov chains. Two models for LSP generation are proposed; the first model characterizes the intra-frame correlation exhibited by the LSP's, while the second model captures both intra-frame and inter-frame correlation. By comparing the entropy rates of the models thus constructed with the CELP rates, it is shown that as many as one-third of the LSP bits in every frame of speech are redundant.
We next consider methods by which this residual redundancy can be exploited by an appropriately designed channel decoder. Before transmission, the LSP's are encoded with a forward error control (FEC) code; we consider both block (Reed- Solomon) codes and convolutional codes. Soft-decision decoders that exploit the residual redundancy in the LSP's are implemented assuming additive white Gaussian noise (AWGN) and independent Rayleigh fading environments. Simulation results employing binary phaseshift keying (BPSK) indicate coding gains of 2 to 5 dB over soft-decision decoders that do not exploit the residual redundancy.
Item Design and Performance of Trellis Codes for Wireless Channels(1995) Al-Semari, Saud A.; Fuja, T.E.; ISRSignal fading is one of the primary sources of performance degradation in mobile radio (wireless) systems. This dissertation addresses three different techniques to improve the performance of communication systems over fading channels, namely trellis coded modulation (TCM), space diversity and sequence maximum a posteriori decoding (MAP).In the first part, TCM schemes that provide high coding gains over the flat, slow Rayleigh distributed fading channel are presented. It is shown that the use of two encoders in parallel used to specify the in-phase and quadrature components of the transmitted signal results in large performance improvements in bit error rates when compared with conventional TCM schemes in which a single encoder is used. Using this approach which we label ﲉ-Q TCM codes with bandwidth efficiencies of 1, 2, and 3 bits/sec/Hz are described for various constraint lengths. The performance of these codes is evaluated using tight upper bounds and simulation.
In the second part, the use of space diversity with three different combining schemes is investigated. Expressions for the cutoff rate parameter are shown for the three combining schemes over the fully interleaved Rayleigh-distributed flat fading channel. Also, tight upper bounds on the pairwise error probability are derived for the three combining schemes. Examples of I-Q TCM schemes with diversity combining are shown. The cutoff rate and a tight upper bound on the pairwise error probability are also derived for maximal ration combining with correlated branches.
In the last part the problem of reliably transmitting trellis coded signals over very noisy channels is considered. Sequence maximum a posteriori (MAP) decoding of correlated signals transmitted over very noisy AWGN and Rayleigh channels is presented. A variety of different systems with different sources, modulation schemes, encoder rates and complexities are simulated. Sequence MAP decoding proves to substantially improve the performance at very noisy channel conditions especially for systems with moderate redundancies and encoder rates. A practical example for coding the CELP line spectral parameters (LSPs) is also considered. Two source models are used. Coding gains of as much as 4 dB are achieved.
Item Analysis of Dynamic Spectra in Ferret Primary Auditory Cortex: II. Prediction of, Unit Responses to Arbitrary Dynamic Spectra(1995) Kowalski, Nina; Depireux, Didier A.; Shamma, S.A.; ISRResponses of single units and unit clusters were recorded in the ferret primary auditory cortex (AI) using broadband complex dynamic spectra. Previous work (Kowalski et al 1995) demonstrated that simpler spectra consisting of single moving ripples (i.e., sinusoidally modulated spectral profiles that travel at a constant velocity along the logarithmic frequency axis) could be used effectively to characterize the response fields and transfer functions of AI cells. An arbitrary complex dynamic spectral profile can be thought of conceptually as being composed of a weighted sum of moving ripple spectra. Such a decomposition can be computed from a two-dimensional spectro- temporal Fourier transform of the dynamic spectral profile with moving ripples as the basis function. Therefore, if AI units were essentially linear satisfying the superposition principle, then their responses to arbitrary dynamic spectra could be predicted from the responses to single moving ripples, i.e., from the units response fields and transfer functions. This conjecture was tested and confirmed with data from 293 combinations of moving ripples, involving complex spectra composed of up to 15 moving ripples of different ripple frequencies and velocities. For each case, response predictions based on the unit transfer functions were compared to measured responses. The correlation between predicted and measured responses was found to be consistently high (84% with rho > 0.6). The distribution of response parameters suggest that AI cells may encode the profile of a dynamic spectrum by performing a multiscale spectro-temporal decomposition of the dynamic spectral profile in a largely linear manner.Item Analysis of Dynamic Spectra in Ferret Primary Auditory Cortex: I. Characteristics of Single Unit Responses to Moving Ripple Spectra(1995) Kowalski, Nina; Depireux, Didier A.; Shamma, S.A.; ISRAuditory stimuli referred to as moving ripples are used to characterize the responses of both single and multiple units in the ferret primary auditory cortex (AI). Moving ripples are broadband complex sounds with sinusoidal spectral profiles that drift along the tonotopic axis at a constant velocity. Neuronal responses to moving ripples are locked to the phase of the ripple, i.e., they exhibit the same periodicity as that of the moving ripple profile. Neural responses are characterized as a function of ripple velocity (temporal property) and ripple frequency (spectral property). Transfer functions describing the response to these temporal and spectral modulations are constructed. Temporal transfer functions are inverse Fourier transformed to obtain impulse response functions that reflect the cell's temporal characteristics. Ripple transfer functions are inverse Fourier transformed to obtain the response field, characterizing the cell's response area along the tonotopic axis. These operations assume linearity in the cell's response to moving ripples. Separability of the temporal and ripple transfer functions is established by comparing transfer functions across different test parameters. Response fields measured with either stationary ripples or moving ripples are shown to be similar. Separability implies that the neuron can be modeled as processing spatio-temporal information in two distinct stages. The assumption of linearity implies that each of these stages is a linear operation.The ripples parameters that characterize cortical cells are distributed somewhat evenly, with the characteristic ripple frequencies ranging from 0.2 to over 2 cycles/octave and the characteristic angular frequency typically ranging from 2 to 20 Hz. Many responses exhibit periodicities not found in the spectral envelope of the stimulus. These periodicities are of two types. Slow rebounds with a period of about 150 ms appear with various strengths in about 30 % of the cells. Fast regular firings, with interspike intervals of the order of 10 ms are much less common and may reflect the ability of certain cells to follow the fine structure of the stimulus.
Item Channel Codes That Exploit the Residual Redundancy in CELP- Encoded Speech(1995) Alajaji, Fady; Phamdo, N.; Fuja, Tom E.; ISRWe consider the problem of reliably transmitting CELP-encoded speech over noisy communication channels. Our objective is to design efficient coding/decoding schemes for the transmission of the CELP line spectral parameters (LSP's) over very noisy channels.We begin by quantifying the amount of ﲲesidual redundancy inherent in the LSP's of Federal Standard 1016 CELP. This is done by modeling the LSP's as first-and second-order Markov chains. Two models for LSP generation are proposed; the first model characterizes the intra-frame correlation exhibited by the LSP's, while the second model captures both intra-frame and inter-frame correlation. By comparing the entropy rates of the models thus constructed with the CELP rates, it is shown that as many as one-third of the LSP bits in every frame of speech are redundant.
We next consider methods by which this residual redundancy can be exploited by an appropriately designed channel decoder. Before transmission, the LSP's are encoded with a forward error control (FEC) code; we consider both (Reed-Solomon) codes and convolutional codes. Soft-decision decoders that exploit the residual redundancy in the LSP's are implemented assuming additive white Gaussian noise (AWGN) and independent Rayleigh fading environments. Simulation results employing binary phase-shifting keying (BPSK) indicate coding gains of 2 to 5 dB over soft-decision decoders that do not exploit the residual redundancy.