Theses and Dissertations from UMD
Permanent URI for this communityhttp://hdl.handle.net/1903/2
New submissions to the thesis/dissertation collections are added automatically as they are received from the Graduate School. Currently, the Graduate School deposits all theses and dissertations from a given semester after the official graduation date. This means that there may be up to a 4 month delay in the appearance of a give thesis/dissertation in DRUM
More information is available at Theses and Dissertations at University of Maryland Libraries.
Browse
6 results
Search Results
Item Efficient learning-based sound propagation for virtual and real-world audio processing applications(2024) Ratnarajah, Anton Jeran; Manocha, Dinesh; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Sound propagation is the process by which sound energy travels through a medium, such as air, to the surrounding environment as sound waves. The room impulse response (RIR) describes this process and is influenced by the positions of the source and listener, the room's geometry, and its materials. Physics-based acoustic simulators have been used for decades to compute accurate RIRs for specific acoustic environments. However, we have encountered limitations with existing acoustic simulators. For example, they require a 3D representation and detailed material knowledge of the environment. To address these limitations, we propose three novel solutions. First, we introduce a learning-based RIR generator that is two orders of magnitude faster than an interactive ray-tracing simulator. Our approach can be trained to input both statistical and traditional parameters directly, and it can generate both monaural and binaural RIRs for both reconstructed and synthetic 3D scenes. Our generated RIRs outperform interactive ray-tracing simulators in speech-processing applications, including Automatic Speech Recognition (ASR), Speech Enhancement, and Speech Separation, by 2.5%, 12%, and 48%, respectively. Secondly, we propose estimating RIRs from reverberant speech signals and visual cues in the absence of a 3D representation of the environment. By estimating RIRs from reverberant speech, we can augment training data to match test data, improving the word error rate of the ASR system. Our estimated RIRs achieve a 6.9% improvement over previous learning-based RIR estimators in real-world far-field ASR tasks. We demonstrate that our audio-visual RIR estimator aids tasks like visual acoustic matching, novel-view acoustic synthesis, and voice dubbing, validated through perceptual evaluation. Finally, we introduce IR-GAN to augment accurate RIRs using real RIRs. IR-GAN parametrically controls acoustic parameters learned from real RIRs to generate new RIRs that imitate different acoustic environments, outperforming Ray-tracing simulators on the Kaldi far-field ASR benchmark by 8.95%.Item MULTIMODAL ANALYSIS OF NEURAL SIGNALS RELATED TO SOURCE MEMORY ENCODING IN YOUNG CHILDREN(2024) Lei, Yuqing; Riggins, Tracy; Psychology; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The emergence of source memory is an important milestone during memory development. Decades of research has explored neural correlates of source memory using electroencephalography (EEG) and functional magnetic resonance imaging (fMRI). However, connections between findings from the two approaches, particularly within children, remain unclear. This dissertation identified fMRI-informed cortical sources of two EEG signals during memory encoding, the P2 and the late slow wave (LSW), that predicted subsequent source memory performance in a sample of children aged 4 to 8 years. Both P2 and LSW were source localized to cortical areas of the medial temporal lobe (MTL), reflecting MTL’s crucial role in both early-stage information processing and late-stage integration of memory, which also validated LSW’s suspected role in memory updating. The P2 effect was localized to all six tested subregions of cortical MTL in both left and right hemispheres, whereas the LSW effect was only present in the parahippocampal cortex and entorhinal cortex. P2 was additionally localized to multiple areas in the frontoparietal network, a cortical network known as the “attention network”, highlighting interactions between memory encoding and other cognitive functions. These results reflect the importance of considering both spatial and temporal aspects of neural activity to decode memory mechanism, and demonstrated the potential of combining multimodal measures in children, paving the way for future developmental research.Item Generalizable Depression Detection and Severity Prediction Using Articulatory Representations of Speech(2022) Seneviratne, Nadee; Espy-Wilson, Carol; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Major Depressive Disorder (MDD) is a mental health disorder that has taken a massive toll on society both socially and financially. Timely diagnosis of MDD is extremely crucial to minimize serious consequences such as suicide. Hence automated solutions that can reliably detect and predict the severity of MDD can play a pivotal role in assisting healthcare professionals in providing timely treatments. MDD is known to affect speech. Leveraging on the changes in speech characteristics that occur due to depression, a lot of vocal biomarkers are being developed to detect depression. However, the study into changes in articulatory coordination associated with depression is under-explored. Speech articulation is a complex activity that requires finely timed coordination across articulators. In a depressed state involving psychomotor slowing, this coordination changes and in turn modifies the perceived speech signal. In this work, we use a direct representation of articulation known as vocal tract variables (TVs) to capture the coordination between articulatory gestures. TVs define the constriction degree and location of articulators (tongue, jaw, lips, velum and glottis). Previously, correlation structure of formants or mel-frequency cepstral coefficients (MFCCs) were used as a proxy for underlying articulatory coordination. We compute the articulatory coordination features (ACFs) which provide details about the correlation among time-series data at different time delays and are therefore rich with information about the underlying coordination level of speech production. Using the rank-ordered eigenspectra obtained from TV based ACFs, we show that depressed speech depicts simpler coordination relative to the speech of the same subjects when in remission which is inline with previous findings. By conducting a preliminary study using a small subset of speech from subjects who transitioned from being severely depressed to being in remission, we show that TV based ACFs outperform formant based ACFs in binary depression classification. We show that depressed speech has reduced variability in terms of reduced coarticulation and undershoot. To validate this, we present a comprehensive acoustic analysis and results of a speech-in-noise perception study to compare the intelligibility of depressed speech relative to not-depressed speech. Our results indicate that depressed speech is at least as intelligible as not-depressed speech. The next stage of our work focuses on developing deep learning based models using TV based ACFs to detect depression and attempts to overcome the limitations in existing work. We combine two speech depression databases with different characteristics which helps to increase the generalizability which is a key objective of this research. Moreover, we segment audio recordings prior to feature extraction to obtain data volumes required to train deep neural networks. We reduce the dimensionality of conventional stacked ACFs of multiple delay scales by using refined ACFs which are carefully curated to remove redundancies and using the strengths of dilated Convolutional Neural Networks. We show that models trained on TV based ACFs are more generalizable compared to its proxy counterparts. Then we develop a multi-stage convolutional recurrent neural network that performs classification at the session-level. We derive the constraints under which this segment-to-session level approach could be used to boost the classification performance. We extend our models to perform depression severity level classification. The TV based ACFs outperform other feature sets in this task as well. The language pattern and semantics can reveal vital information regarding a person's mental state. We develop a multimodal depression classifier which incorporates TV based ACFs and hierarchical attention based text embeddings. The fusion strategy of the proposed architecture enables segmenting data from different modalities independently (overlapping segments for audio and sentences for text), in the most optimal way for each modality, when performing segment-to-session level classification. The multimodal classifier clearly performs better than the unimodal classifiers. Finally, we develop a multimodal system to predict the depression severity score, which is a more challenging regression problem due to the quasi-numerical nature of the scores. Multimodal regressor achieves the lowest root mean squared error showing the synergies of combining multiple modalities such as audio and text. We perform an exhaustive error analysis that reveals potential improvements to be made in the future. The work in this dissertation takes a step forward towards the betterment of humanity by exploring the development of technologies to improve the performance of speech based depression assessment, utilizing the strengths of the ACFs derived from direct articulatory representations.Item USE OF MULTIMODAL COMMUNICATION IN PLAY INTERACTIONS WITH CHILDREN WITH AUTISM(2020) Rain, Avery; Bernstein Ratner, Nan; Hearing and Speech Sciences; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)In typical adult-child interaction, adults tend to coordinate gesture and other nonverbal modes of communication with their verbalizations (multimodal communication). This study explored the effectiveness of multimodal communication with young children with autism spectrum disorders (ASD) to encourage child responses. The maternal use of verbal, nonverbal, and multimodal initiations and the subsequent response or lack of response of their child was examined in fifty mother/child video-recorded play interactions. Results indicated that mothers initiated multimodally at similar rates with children with lower and higher expressive language levels. Child response rates to multimodal communication initiations were higher than response rates to verbal-only or nonverbal-only initiations; this finding was consistent across low and high expressive language groups. Additionally, a significant positive correlation was found between maternal wait time after initiation and overall child response rate. These findings have important ramifications for clinical practice and parent training.Item MULTIMODAL TRAVEL BEHAVIOR ANALYSIS AND MONITORING AT METROPOLITAN LEVEL USING PUBLIC DOMAIN DATA(2019) PENG, BO; Zhang, Lei; Civil Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Travel behavior data enable the understanding of why, how, and when people travel, and play a critical role in travel trend monitoring, transportation planning, and policy decision support. Conventional travel behavior data collection methods such as the National Household Travel Survey (NHTS) have been the primary source of travel behavior information for transportation agencies. However, the relatively high cost of traditional travel surveys often prohibits frequent survey cycles (currently once every 5-10 years). With decision makers increasingly requesting recent and up-to-date information on multimodal travel trends, establishing a sustainable and timely travel monitoring program based on available data sources from the public domain is in order. This dissertation developed advanced data processing, expansion, fusion and analysis methods and integrated such methods with existing public domain data into a comprehensive model that allows transportation agencies to track monthly multimodal travel behavior trends, e.g., mode share, number of trips, and trip frequency, at the metropolitan level. Advanced data analytical methods are developed to overcome significant challenges for tracking monthly travel behavior trends of different modes. The proposed methods are tailored to address different challenges for different modes and are flexible enough to accommodate heterogeneous spatial and temporary resolutions and updating schedules of different data sources. For the driving mode, this dissertation developed reliable methods for estimates of local road VMT, various temporal adjustment factors, truck percentage factors, average vehicular occupancy, and average trip length based on additional data from the Travel Monitoring Analysis System and the most recent regional household travel survey to translate HPMS data into monthly number of vehicular and person driving trips for a metropolitan area. For the transit mode, this dissertation collectively exhausted detailed transit network geo-data to complement NTD and developed advanced geo-analysis and statistical methods tailored to the service network of different types of operators to accurately and reliably allocate ridership data to the metropolitan area of interest, and to allocate annual ridership data to each month. The data for non-motorized is even more sparse, although the local government has growing interests and efforts on collecting such data. A two-step statistical model is developed to derive the trend for non-motorized modes and then integrating such trends with base-year number of trips number from most recent household travel survey conducted in the metropolitan areas of interest. Based on the number of trips by modes estimated using the proposed methods, the monthly trend in mode share can be timely estimated and continuously monitored over time for the first time in the literature using public domain data only. The dissertation has demonstrated that it is feasible to develop a comprehensive model for multimodal travel trend monitoring and analysis by integrating a wide range of traffic and travel behavior data sets of multiple travel modes. Based on findings, it can be concluded that the proposed public-domain databases and data processing, expansion, fusion and analysis methods can provide a reliable way to monitor the month-to-month multimodal travel demand at the metropolitan level across the U.S.Item Data and Methods for Reference Resolution in Different Modalities(2017) Guha, Anupam; Aloimonos, Yiannis; Boyd-Graber, Jordan; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)One foundational goal of artificial intelligence is to build intelligent agents which interact with humans, and to do so, they must have the capacity to infer from human communication what concept is being referred to in a span of symbols. They should be able, like humans, to map these representations to perceptual inputs, visual or otherwise. In NLP, this problem of discovering which spans of text are referring to the same real-world entity is called Coreference Resolution. This dissertation expands this problem to go beyond text and maps concepts referred to by text spans to concepts represented in images. This dissertation also investigates the complex and hard nature of real world coreference resolution. Lastly, this dissertation expands upon the definition of references to include abstractions referred by non-contiguous text distributions. A central theme throughout this thesis is the paucity of data in solving hard problems of reference, which it addresses by designing several datasets. To investigate hard text coreference this dissertation analyses a domain of coreference heavy text, namely questions present in the trivia game of quiz bowl and creates a novel dataset. Solving quiz bowl questions requires robust coreference resolution and world knowledge, something humans possess but current models do not. This work uses distributional semantics for world knowledge. Also, this work addresses the sub-problems of coreference like mention detection. Next, to investigate complex visual representations of concepts, this dissertation uses the domain of paintings. Mapping spans of text in descriptions of paintings to regions of paintings being described by that text is a non-trivial problem because paintings are sufficiently harder than natural images. Distributional semantics are again used here. Finally, to discover prototypical concepts present in distributed rather than contiguous spans of text, this dissertation investigates a source which is rich in prototypical concepts, namely movie scripts. All movie narratives, character arcs, and character relationships, are distilled to sequences of interconnected prototypical concepts which are discovered using unsupervised deep learning models, also using distributional semantics. I conclude this dissertation by discussing potential future research in downstream tasks which can be aided by discovery of referring multi-modal concepts.