Efficient learning-based sound propagation for virtual and real-world audio processing applications

dc.contributor.advisorManocha, Dineshen_US
dc.contributor.authorRatnarajah, Anton Jeranen_US
dc.contributor.departmentElectrical Engineeringen_US
dc.contributor.publisherDigital Repository at the University of Marylanden_US
dc.contributor.publisherUniversity of Maryland (College Park, Md.)en_US
dc.date.accessioned2025-01-25T06:43:25Z
dc.date.available2025-01-25T06:43:25Z
dc.date.issued2024en_US
dc.description.abstractSound propagation is the process by which sound energy travels through a medium, such as air, to the surrounding environment as sound waves. The room impulse response (RIR) describes this process and is influenced by the positions of the source and listener, the room's geometry, and its materials. Physics-based acoustic simulators have been used for decades to compute accurate RIRs for specific acoustic environments. However, we have encountered limitations with existing acoustic simulators. For example, they require a 3D representation and detailed material knowledge of the environment. To address these limitations, we propose three novel solutions. First, we introduce a learning-based RIR generator that is two orders of magnitude faster than an interactive ray-tracing simulator. Our approach can be trained to input both statistical and traditional parameters directly, and it can generate both monaural and binaural RIRs for both reconstructed and synthetic 3D scenes. Our generated RIRs outperform interactive ray-tracing simulators in speech-processing applications, including Automatic Speech Recognition (ASR), Speech Enhancement, and Speech Separation, by 2.5%, 12%, and 48%, respectively. Secondly, we propose estimating RIRs from reverberant speech signals and visual cues in the absence of a 3D representation of the environment. By estimating RIRs from reverberant speech, we can augment training data to match test data, improving the word error rate of the ASR system. Our estimated RIRs achieve a 6.9% improvement over previous learning-based RIR estimators in real-world far-field ASR tasks. We demonstrate that our audio-visual RIR estimator aids tasks like visual acoustic matching, novel-view acoustic synthesis, and voice dubbing, validated through perceptual evaluation. Finally, we introduce IR-GAN to augment accurate RIRs using real RIRs. IR-GAN parametrically controls acoustic parameters learned from real RIRs to generate new RIRs that imitate different acoustic environments, outperforming Ray-tracing simulators on the Kaldi far-field ASR benchmark by 8.95%.en_US
dc.identifierhttps://doi.org/10.13016/lolv-engl
dc.identifier.urihttp://hdl.handle.net/1903/33611
dc.language.isoenen_US
dc.subject.pqcontrolledAcousticsen_US
dc.subject.pqcontrolledComputer scienceen_US
dc.subject.pqcontrolledArtificial intelligenceen_US
dc.subject.pquncontrolledGenerative Adversarial Networken_US
dc.subject.pquncontrolledMultimodalen_US
dc.subject.pquncontrolledRoom Acousticsen_US
dc.subject.pquncontrolledRoom Impulse Responseen_US
dc.subject.pquncontrolledSound Propagationen_US
dc.subject.pquncontrolledSpeech Processingen_US
dc.titleEfficient learning-based sound propagation for virtual and real-world audio processing applicationsen_US
dc.typeDissertationen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Ratnarajah_umd_0117E_24716.pdf
Size:
42.95 MB
Format:
Adobe Portable Document Format