Sound Sensing, Enhancement, and Separation with Millimeter Wave Radio

Thumbnail Image


Ozturk_umd_0117E_22905.pdf (8.7 MB)
No. of downloads:

Publication or External Link





Sound, as one the most natural way of human communication, has become a ubiquitous modality for human-machine-environment interactions. Despite many environmental sensing capabilities enabled by microphones, sound sensing systems have limitations, such as weak source separation when multiple speakers are present, being prone to replay attacks, and reduced performance under interference and noise. On the other hand, thanks to the availability of next generation communication systems and miniaturized radars, mmWave has become an emerging sensing modality in the recent years. Mobile phones and smart hubs include mmWave radars for environment sensing. To extend the sensing capabilities of these devices, and overcome limitations of microphones, we explore sound sensing and its applications by mmWave radars.

In this dissertation, we first explore how and to what extent ambient sound and sound induced vibration could be sensed by mmWave-based sensing. We first establish fundamentals to sense sound from ambient objects, such as a piece of aluminum foil, or active speaker surfaces. We show that, unlike microphones, which sense the sound at the sensor location, radars can sense sound remotely (e.g. from the environment), and robustly. We conduct a variety of experiments to understand the limitations of sound sensing from passive objects. After establishing the fundamentals of sound sensing from the environment, we propose RadioMic, a system that can detect and localize the source of a sound robustly, and enhance the noisy radar signals via deep learning methods. Extensive experiments show how our proposal outperforms existing work and enables sound sensing in challenging conditions, such as through-wall and through-soundproof objects. Furthermore, RadioMic can extract individual sound streams when multiple sources are present. Last, we illustrate how RadioMic can detect whether a source is a live source or an inanimate source, mitigating the vulnerability of microphones against replay attacks.

Next, we investigate another limitation of microphone-based sensing, being prone to interference and noise. In other words, microphones usually have weak source separation capabilities, and recent deep learning based approaches do not perform well under challenging conditions. Furthermore, monaural speech separation has additional limitations, such as the problem of source association, and the number of speaker estimation. We build a system RadioSES that uses complementary radio modality to mitigate these fundamental drawbacks of microphones in speech enhancement and separation. Our extensive experiments indicate that RadioSES solves source association and tracking problems robustly, and improves the performance in speech enhancement and separation by 3 to 6 dB SiSDR, compared to the audio-only baseline. Furthermore, RadioSES can work in dark and through occlusion cases, and is preferable over using video modality, as it is less privacy concerning and computationally more efficient.

Last, we study the voice activity detection (VAD) problem using radio modality. Voice activity detection is an integral part of smart speakers and voice transmission systems. A highperformance and automated VAD is of utmost importance, especially when the user intervention is limited, such as while driving a car. When the application requires focusing on a particular user, existing audio-based methods perform poorly as the interfering speakers or severe background noise create false alarms. We present RadioVAD, a radio-based VAD system that is robust against interference and noise. Our careful evaluation indicates that RadioVAD can match the performance of audio-VAD, at a much lower computational complexity, and can outperform existing approaches. Furthermore, we present different case studies to better understand the tradeoff between audio and radio SNRs, and investigate the false alarm, precision, recall rates, and detection delay carefully.