A. James Clark School of Engineering
Permanent URI for this communityhttp://hdl.handle.net/1903/1654
The collections in this community comprise faculty research works, as well as graduate theses and dissertations.
Browse
7 results
Search Results
Item TOWARDS EFFICIENT OCEANIC ROBOT LEARNING WITH SIMULATION(2024) LIN, Xiaomin; Aloimonos, Yiannis; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)In this dissertation, I explore the intersection of machine learning, perception, and simulation-based techniques to enhance the efficiency of underwater robotics, with a focus on oceanic tasks. My research begins with marine object detection using aerial imagery. From there, I address oyster detection using Oysternet, which leverages simulated data and Generative Adversarial Networks for sim-to-real transfer, significantly improving detection accuracy. Next, I present an oyster detection system that integrates diffusion-enhanced synthetic data with the Aqua2 biomimetic hexapedal robot, enabling real-time, on-edge detection in underwater environments. With detection models deployed locally, this system facilitates autonomous exploration. To enhance this capability, I introduce an underwater navigation framework that employs imitation learning, enabling the robot to efficiently navigate over objects of interest, such as rock and oyster reefs, without relying on localization. This approach improves information gathering while ensuring obstacle avoidance. Given that oyster habitats are often in shallow waters, I incorporate a deep learning model for real/virtual image segmentation, allowing the robot to differentiate between actual objects and water surface reflections, ensuring safe navigation. I expand on broader applications of these techniques, including olive detection for yield estimation and industrial object counting for warehouse management, using simulated imagery. In the final chapters, I address unresolved challenges, such as RGB/sonar data integration, and propose directions for future research to enhance underwater robotic learning through digital simulation further. Through these studies, I demonstrate how machine learning models and digital simulations can be used synergistically to address key challenges in underwater robotic tasks. Ultimately, this work advances the capabilities of autonomous systems to monitor and preserve marine ecosystems through efficient and robust digital simulation-based learning.Item INDOOR TARGET SEARCH, DETECTION, AND INSPECTION WITH AN AUTONOMOUS DRONE(2024) Ashry, Ahmed; Paley, Derek; Aerospace Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)This thesis investigates the deployment of unmanned aerial vehicles (UAVs) in indoor search and rescue (SAR) operations, focusing on enhancing autonomy through the development and integration of advanced technological solutions. The research addresses challenges related to autonomous navigation and target inspection in indoor environments. A key contribution is the development of an autonomous inspection routine that allows UAVs to navigate to and meticulously inspect targets identified by fiducial markers, replacing manual piloted inspection. To enhance the system’s target recognition, a custom-trained object detection model identifies critical markers on targets, operating in real-time on the UAV’s onboard computer. Additionally, the thesis introduces a comprehensive mission framework that manages transitions between coverage and inspection phases, experimentally validated using a quadrotor equipped with onboard sensing and computing across various scenarios. The research also explores integration and critical analysis of state-of-the-art path planning algorithms, enhancing UAV autonomy in cluttered settings. This is supported by evaluations conducted through software-in-the-loop simulations, setting the stage for future real-world applications.Item DEEP LEARNING ENSEMBLES FOR LIGHTWEIGHT OBJECT DETECTION(2023) Mattingly, Alexander Singfei; Bhattacharyya, Shuvra S.; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Object detection, the task of identifying and localizing important objectswithin an image frame, is a critical task in automation, surveillance, and safety applications. Further, developments in lightweight sensor technologies, improved small-scale computing, and the widespread accessibility of well-labeled data have enabled numerous applications for object detection on inexpensive or low-power hardware. Many applications, such as self-driving and unmanned aerial vehicles, must process sensor data as it arrives (in real-time) using onboard hardware (at- the-edge) in order to continually inform systems such as navigation. Additionally, detection must be often achieved on platforms with limited Size, Weight, and Power (SWaP) since advanced computer hardware may not be possible to place nearby the sensor. This presents a unique challenge: how can we best provide accurate real-time object detection on limited SWaP systems while maintaining low power and computational cost? A widespread approach for detection is using deep-learning. An object de-tection network is trained on a labeled dataset of images containing known objects and their location. After training, the network may be used to infer on new data, providing both bounding boxes and class identifiers for each box. Popular single- shot detectors have been demonstrated to achieve real-time performance on some systems while having acceptable detection accuracy. An ensemble is a system comprised of several detectors. In theory, detectorswith architectural differences, ones trained on different data, or detectors given different augmented data at inference time will discover and detect different features of an image. Unifying the results of several different detectors has been demonstrated to improve the detection performance of the ensemble compared to the performance of any component network at the expense of additional computational cost. Further, systems using an ensemble of detectors have been shown to be good solutions to object detection problems in limited SWaP applications such as surveillance and search-and-rescue. Unlike tasks such as classification, where the output of a network describes theentire input, object detection is concerned both with localization and classification of one or multiple objects in an image. Two different bounding boxes for partially occluded objects may overlap, or highly similar bounding boxes may describe the same object. As a result, unifying the results of object detector networks is far more difficult than unifying classifier networks. Current works typically accomplish this by applying strategies that iteratively combine bounding boxes by overlap. However, little comparative study has been done to determine the effectiveness of these approaches. This thesis builds on current methods of ensembling object detector networksusing novel approaches to combine bounding boxes. We first introduce current methods for ensembling and a dataflow-based framework for efficient, scalable com- putation of ensembles of detectors. We then contribute a novel method for ensem- bling and implement a practical system for scalable detection using an elastic neural network.Item Towards in-the-wild visual understanding(2022) Rambhatla, Sai Saketh; Chellappa, Rama; Shrivastava, Abhinav; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Computer vision research has seen tremendous success in recent times . This success can be attributed to recent breakthroughs in deep learning technology and such systems have been shown to achieve super human performance on several academic datasets. Driven by this success, these systems are actively being deployed in several household and industrial applications like robotics. However, current systems perform poorly when deployed in the real world, a.k.a in-the-wild, as most of the assumptions made during the modeling stage are violated. For example, consider object detectors, they require clean data for training and they are not effective in detecting or rejecting novel categories not seen in the data.In this thesis, we systematically identify problems that arise in a typical learning setup, the input, model and the output, and propose effective solutions to mitigate them.Item Sparse and Deep Representations for Face Recognition and Object Detection(2019) Xu, Hongyu; Chellappa, Rama; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Face recognition and object detection are two very fundamental visual recognition applications in computer vision. How to learn “good” feature representations using machine learning has become the cornerstone of perception-based systems. A good feature representation is often the one that is robust and discriminative to multiple instances of the same category. Starting from features such as intensity, histogram etc. in the image, followed by hand-crafted features, to the most recent sophisticated deep feature representations, we have witnessed the remarkable improvement in the ability of a feature learning algorithm to perform pattern recognition tasks such as face recognition and object detection. One of the conventional feature learning methods, dictionary learning has been proposed to learn discriminative and sparse representations for visual recognition. These dictionary learning methods can learn both representative and discriminative dictionaries, and the associated sparse representations are effective for vision tasks such as face recognition. More recently, deep features have been widely adopted by the computer vision community owing to the powerful deep neural network, which is capable of distilling information from high dimensional input spaces to a low dimensional semantic space. The research problems which comprise this dissertation lie at the cross section of conventional feature and deep feature learning approaches. Thus, in this dissertation, we study both sparse and deep representations for face recognition and object detection. First, we begin by studying the topic of spare representations. We present a simple thresholded feature learning algorithm under sparse support recovery. We show that under certain conditions, the thresholded feature exactly recovers the nonzero support of the sparse code. Secondly, based on the theoretical guarantees, we derive the model and algorithm named Dictionary Learning for Thresholded Features (DLTF), to learn the dictionary that is optimized for the thresholded feature. The DLTF dictionaries are specifically designed for using the thresholded feature at inference, which prioritize simplicity, efficiency, general usability and theoretical guarantees. Both synthetic simulations and real-data experiments (i.e. image clustering and unsupervised hashing) verify the competitive quantitative results and remarkable efficiency of applying thresholded features with DLTF dictionaries. Continuing our focus on investigating the sparse representation and its application to computer vision tasks, we address the sparse representations for unconstrained face verification/recognition problem. In the first part, we address the video-based face recognition problem since it brings more challenges due to the fact that the videos are often acquired under significant variations in poses, expressions, lighting conditions and backgrounds. In order to extract representations that are robust to these variations, we propose a structured dictionary learning framework. Specifically, we employ dictionary learning and low-rank approximation methods to preserve the invariant structure of face images in videos. The learned structured dictionary is both discriminative and reconstructive. We demonstrate the effectiveness of our approach through extensive experiments on three video-based face recognition datasets. Recently, template-based face verification has gained more popularity. Unlike traditional verification tasks, which evaluate on image-to-image or video-to-video pairs, template-based face verification/recognition methods can exploit training and/or gallery data containing a mixture of both images or videos from the person of interest. In the second part, we propose a regularized sparse coding approach for template-based face verification. First, we construct a reference dictionary, which represents the training set. Then we learn the discriminative sparse codes of the templates for verification through the proposed template regularized sparse coding approach. Finally, we measure the similarity between templates. However, in real world scenarios, training and test data are sampled from different distributions. Therefore, we also extend the dictionary learning techniques to tackle the domain adaptation problem, where the data from the training set (source domain) and test set (target domain) have different underlying distributions (domain shift). We propose a domain-adaptive dictionary learning framework to model the domain shift by generating a set of intermediate domains. These intermediate domains bridge the gap between the source and target domains. Specifically, we not only learn a common dictionary to encode the domain-shared features but also learn a set of domain specific dictionaries to model the domain shift. This separation enables us to learn more compact and reconstructive dictionaries for domain adaptation. The domain-adaptive features for recognition are finally derived by aligning all the recovered feature representations of both source and target along the domain path. We evaluate our approach on both cross-domain face recognition and object classification tasks. Finally, we study another fundamental problem in computer vision: generic object detection. Object detection has become one of the most valuable pattern recognition tasks, with great benefits in scene understanding, face recognition, action recognition, robotics and self-driving vehicles, etc. We propose a novel object detector named "Deep Regionlets" by blending deep learning and the traditional regionlet method. The proposed framework "Deep Regionlets" is able to address the limitations of traditional regionlet methods, leading to significant precision improvement by exploiting the power of deep convolutional neural networks. Furthermore, we conduct a detailed analysis of our approach to understand its merits and properties. Extensive experiments on two detection benchmark datasets show that the proposed deep regionlet approach outperforms several state-of-the-art competitors.Item FAST–AT: FAST AUTOMATIC THUMBNAIL GENERATION USING DEEP NEURAL NETWORKS(2017) Esmaeili, Seyed Abdulaziz; Davis, Larry S; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Fast-AT is an automatic thumbnail generation system based on deep neural networks. It is a fully-convolutional CNN, which learns specific filters for thumbnails of different sizes and aspect ratios. During inference, the appropriate filter is selected depending on the dimensions of the target thumbnail. Unlike most previous work, Fast-AT does not utilize saliency but addresses the problem directly. In addition, it eliminates the need to conduct region search on the saliency map. The model generalizes to thumbnails of different sizes including those with extreme aspect ratios and can generate thumbnails in real time. A data set of more than 70,000 thumbnail annotations was collected to train Fast-AT. We show competitive results in comparison to existing techniques.Item Recognizing Objects And Reasoning About Their Interactions(2010) Kembhavi, Aniruddha; Davis, Larry S; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The task of scene understanding involves recognizing the different objects present in the scene, segmenting the scene into meaningful regions, as well as obtaining a holistic understanding of the activities taking place in the scene. Each of these problems has received considerable interest within the computer vision community. We present contributions to two aspects of visual scene understanding. First we explore multiple methods of feature selection for the problem of object detection. We demonstrate the use of Principal Component Analysis to detect avifauna in field observation videos. We improve on existing approaches by making robust decisions based on regional features and by a feature selection strategy that chooses different features in different parts of the image. We then demonstrate the use of Partial Least Squares to detect vehicles in aerial and satellite imagery. We propose two new feature sets; Color Probability Maps are used to capture the color statistics of vehicles and their surroundings, and Pairs of Pixels are used to capture captures the structural characteristics of objects. A powerful feature selection analysis based on Partial Least Squares is employed to deal with the resulting high dimensional feature space (almost 70,000 dimensions). We also propose an Incremental Multiple Kernel Learning (IMKL) scheme to detect vehicles in a traffic surveillance scenario. Obtaining task and scene specific datasets of visual categories is far more tedious than obtaining a generic dataset of the same classes. Our IMKL approach initializes on a generic training database and then tunes itself to the classification task at hand. Second, we develop a video understanding system for scene elements, such as bus stops, crosswalks, and intersections, that are characterized more by qualitative activities and geometry than by intrinsic appearance. The domain models for scene elements are not learned from a corpus of video, but instead, naturally elicited by humans, and represented as probabilistic logic rules within a Markov Logic Network framework. Human elicited models, however, represent object interactions as they occur in the 3D world rather than describing their appearance projection in some specific 2D image plane. We bridge this gap by recovering qualitative scene geometry to analyze object interactions in the 3D world and then reasoning about scene geometry, occlusions and common sense domain knowledge using a set of meta-rules.