DEEP LEARNING FOR SCENE PERCEPTION AND UNDERSTANDING
Publication or External Link
The ability to accurately perceive objects and capture motion information from the environment is crucial in many real-world applications, including autonomous driving, augmented reality and robotics. This dissertation focuses on some fundamental challenges, regarding scene perception, scene understanding, and learning- based autonomous system.
We first address the problem of developing a good representation of 3D sensor data for solving scene perception tasks. We start by focusing on learning how to explore the environment of a 3D perception system, including accurately perceiving objects and understanding the motion of dynamic objects. For example, it is critical for robotic agents to be able to develop a good understanding of objects in their environment. We investigate and tackle this problem through different computer vision tasks using a variety of input data. Compared with images, 3D point clouds provide reliable depth and precise geometric information; however, they are generally sparse with varying densities. To handle these challenges, we present a number of methods for efficient object detection and motion learning in the case of large-scale LiDAR point cloud data. In the first part, we consider the problem of 3D point cloud density, not well-explored characteristic for the task of 3D object detection. Our proposed InfoFocus method improves detection by adaptively refining features guided by the information of point cloud density in an end-to-end manner. Inspired by the success of transformer-based architectures in a variety of computer vision tasks, we consequently present another method M3DETR, which unifies multiple point cloud representations, feature scales, as well as model mutual relationships between point clouds simultaneously using transformers for 3D object detection. We also consider the problem of understanding dynamic 3D environments and identifying motion information of objects, which is critical for 3D perception. In the third part, we focus on a temporal sequence of 3D point clouds to extract point-wise motion information. Specifically, we propose a point-based spatiotemporal pyramid architecture, PointMotionNet which handles multiple frames and large-scale scenes, avoids discretization and explicitly learns from the temporal ordering.
We note that having a deeper and holistic understanding of environment is quite important to help safely navigate through complex traffic scenarios. Besides accurately classifying, locating objects and predicting their behaviors, it would be crucial for the autonomous system to understand traffic rules of the road, such as spotting traffic signals or temporary road signs. The long-term goal is to build a perception system that has the ability to reason about the environment and adaptively make plans under uncertainty in real time. To reason and make real-time adjustments, the system needs to able to develop a good understanding of the road signs information. Here we address this task of Text-VQA which aims at answering questions that require understanding the textual cues in an image. In the fourth part of the thesis, we develop a method to generate high-quality and rich question-answer (QA) pairs by explicitly utilizing the existing rich text available in the scene context of the input image. The proposed architecture, TAG exploits underexplored scene text information and enhances scene understanding of Text-VQA models by producing meaningful, and accurate QA samples using a multimodal transformer. This method has the potential to be applied to identify challenging traffic situations that the autonomous vehicles will encounter on roads, such as traffic signs (stop/speed limit), one-way street, or evolving streets including road closure or a construction zone.