Electrical & Computer Engineering
Permanent URI for this communityhttp://hdl.handle.net/1903/2234
Browse
4 results
Search Results
Item Deep-Learning Based Image Analysis on Resource-Constrained Systems(2021) Lee, Eung Joo; Bhattacharyya, Shuvra S; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)In recent years, deep learning has led to high-end performance on a very wide variety of computer vision tasks. Among different types of deep neural networks, convolutional neural networks (CNNs) are extensively studied and utilized for image analysis purposes, as CNNs have the capability to effectively capture spatial and temporal dependencies in images. The growth in the amount of annotated image data and improvements in graphics processing units are factors in the rapid gain in popularity of CNN-based image analysis systems. This growth in turn motivates investigation into the application of CNN-based deep learning to increasingly complex tasks, including an increasing variety applications at the network edge. The application of deep CNNs to novel edge applications involves two major challenges. First, in many of the emerging edge-based application areas, there is a lack of sufficient training data or an uneven class balance within the datasets. Second, stringent implementation constraints --- including constraints on real-time performance, memory requirements, and energy consumption --- must be satisfied to enable practical deployment. In this thesis, we address these challenges in developing deep-CNN-based image analysis systems for deployment on resource-constrained devices at the network edge. To tackle the challenges for medical image analysis, we first propose a methodology and tool for semi-automated training dataset generation in support of robust segmentation. The framework is developed to provide robust segmentation of surgical instruments using deep learning. We then address the problem of training dataset generation for real-time object tracking using a weakly supervised learning method. In particular, we present a weakly supervised method for surgical tool tracking based on a class of hybrid sensor systems. The targeted class of systems combines electromagnetic (EM) and vision-based modalities. Furthermore, we present a new framework for assessing the quality of nonrigid multimodality image registration in real-time. With the augmented dataset, we construct a solution using various registration quality metrics that are integrated to form a single binary assessment of image registration effectiveness as either high quality or low quality. To address challenges in practical deployment, we present a deep-learning-based hyperspectral image (HSI) classification method that is designed for deployment on resource-constrained devices at the network edge. Due to the large volumes of data produced by HSI sensors, and the complexity of deep neural network (DNN) architectures, developing DNN solutions for HSI classification on resource-constrained platforms is a challenging problem. In this part of the thesis, we introduce a novel approach that integrates DNN-based image analysis with discrete cosine transform (DCT) analysis for HSI classification. In addition to medical image processing and HSI classification, a third application area that we investigate in this thesis is on-board object detection from Unmanned Aerial Vehicles (UAVs), which represents another important domain of interest for the edge-based deployment of CNN methods. In this part of the thesis, we present a novel framework for object detection using images captured from UAVs. The framework is optimized using synthetic datasets that are generated from a game engine to capture imaging scenarios that are specific to the UAV-based operating environment. Using the generated synthetic dataset, we develop new insight on the impact of different UAV-based imaging conditions on object detection performance.Item DEEP LEARNING FOR FASHION AND FORENSICS(2018) Han, Xintong; Davis, Larry S; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Deep learning is the new electricity, which has dramatically reshaped people's everyday life. In this thesis, we focus on two emerging applications of deep learning - fashion and forensics. The ubiquity of online fashion shopping demands effective search and recommendation services for customers. To this end, we first propose an automatic spatially-aware concept discovery approach using weakly labeled image-text data from shopping websites. We first fine-tune GoogleNet by jointly modeling clothing images and their corresponding descriptions in a visual-semantic embedding space. Then, for each attribute (word), we generate its spatially-aware representation by combining its semantic word vector representation with its spatial representation derived from the convolutional maps of the fine-tuned network. The resulting spatially-aware representations are further used to cluster attributes into multiple groups to form spatially-aware concepts (e.g., the neckline concept might consist of attributes like v-neck, round-neck}, \textit{etc}). Finally, we decompose the visual-semantic embedding space into multiple concept-specific subspaces, which facilitates structured browsing and attribute-feedback product retrieval by exploiting multimodal linguistic regularities. We conducted extensive experiments on our newly collected Fashion200K dataset, and results on clustering quality evaluation and attribute-feedback product retrieval task demonstrate the effectiveness of our automatically discovered spatially-aware concepts. For fashion recommendation tasks, we study two types of fashion recommendation: (i) suggesting an item that matches existing components in a set to form a stylish outfit (a collection of fashion items), and (ii) generating an outfit with multimodal (images/text) specifications from a user. To this end, we propose to jointly learn a visual-semantic embedding and the compatibility relationships among fashion items in an end-to-end fashion. More specifically, we consider a fashion outfit to be a sequence (usually from top to bottom and then accessories) and each item in the outfit as a time step. Given the fashion items in an outfit, we train a bidirectional LSTM (Bi-LSTM) model to sequentially predict the next item conditioned on previous ones to learn their compatibility relationships. Further, we learn a visual-semantic space by regressing image features to their semantic representations aiming to inject attribute and category information as a regularization for training the LSTM. The trained network can not only perform the aforementioned recommendations effectively but also predict the compatibility of a given outfit. We conduct extensive experiments on our newly collected Polyvore dataset, and the results provide strong qualitative and quantitative evidence that our framework outperforms alternative methods. In addition to searching and recommendation, customers also would like to virtually try-on fashion items. We present an image-based VIirtual Try-On Network (VITON) without using 3D information in any form, which seamlessly transfers a desired clothing item onto the corresponding region of a person using a coarse-to-fine strategy. Conditioned upon a new clothing-agnostic yet descriptive person representation, our framework first generates a coarse synthesized image with the target clothing item overlaid on that same person in the same pose. We further enhance the initial blurry clothing area with a refinement network. The network is trained to learn how much detail to utilize from the target clothing item, and where to apply to the person in order to synthesize a photo-realistic image in which the target item deforms naturally with clear visual patterns. Experiments on our newly collected dataset demonstrate its promise in the image-based virtual try-on task over state-of-the-art generative models. Interestingly, VITON can be modified to swap faces instead of swapping clothing items. Conditioned on the landmarks of a face, generative adversarial networks can synthesize a target identity on to the original face keeping the original facial expression. We achieve this by introducing an identity preserving loss together with a perceptually-aware discriminator. The identity preserving loss tries to keep the synthesized face presents the same identity as the target, while the perceptually-aware discriminator ensures the generated face looks realistic. It is worth noticing that these face-swap techniques can be easily used to manipulated people's faces, and might cause serious social and political consequences. Researchers have developed powerful tools to detect these manipulations. In this dissertation, we utilize convolutional neural networks to boost the detection accuracy of tampered face or person in images. Firstly, a two-stream network is proposed to determine if a face has been tampered with. We train a GoogLeNet to detect tampering artifacts in a face classification stream, and train a patch based triplet network to leverage features capturing local noise residuals and camera characteristics as a second stream. In addition, we use two different online face swapping applications to create a new dataset that consists of 2010 tampered images, each of which contains a tampered face. We evaluate the proposed two-stream network on our newly collected dataset. Experimental results demonstrate the effectiveness of our method. Further, spliced people are also very common in image manipulation. We describe a tampering detection system containing multiple modules, which model different aspects of tampering traces. The system first detects faces in an image. Then, for each detected face, it enlarges the bounding box to include a portrait image of that person. Three models are fused to detect if this person (portrait) is tampered or not: (i) PortraintNet: A binary classifier fine-tuned on ImageNet pre-trained GoogLeNet. (ii) SegNet: A U-Net predicts tampered masks and boundaries, followed by a LeNet to classify if the predicted masks and boundaries indicating the image has been tampered with or not. (iii) EdgeNet: A U-Net predicts the edge mask of each portrait, and the extracted portrait edges are fed into a GoogLeNet for tampering classification. Experiments show that these three models are complementary and can be fused to effectively detect a spliced portrait image.Item COMPUTER VISION AND DEEP LEARNING WITH APPLICATIONS TO OBJECT DETECTION, SEGMENTATION, AND DOCUMENT ANALYSIS(2017) Du, Xianzhi; Davis, Larry; Doermann, David; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)There are three work on signature matching for document analysis. In the first work, we propose a large-scale signature matching method based on locality sensitive hashing (LSH). Shape Context features are used to describe the structure of signatures. Two stages of hashing are performed to find the nearest neighbors for query signatures. We show that our algorithm can achieve a high accuracy even when few signatures are collected from one same person and perform fast matching when dealing with a large dataset. In the second work, we present a novel signature matching method based on supervised topic models. Shape Context features are extracted from signature shape contours which capture the local variations in signature properties. We then use the concept of topic models to learn the shape context features which correspond to individual authors. We demonstrate considerable improvement over state of the art methods. In the third work, we present a partial signature matching method using graphical models. In additional to the second work, modified shape context features are extracted from the contour of signatures to describe both full and partial signatures. Hierarchical Dirichlet processes are implemented to infer the number of salient regions needed. The results show the effectiveness of the approach for both the partial and full signature matching. There are three work on deep learning for object detection and segmentation. In the first work, we propose a deep neural network fusion architecture for fast and robust pedestrian detection. The proposed network fusion architecture allows for parallel processing of multiple networks for speed. A single shot deep convolutional network is trained as an object detector to generate all possible pedestrian candidates of different sizes and occlusions. Next, multiple deep neural networks are used in parallel for further refinement of these pedestrian candidates. We introduce a soft-rejection based network fusion method to fuse the soft metrics from all networks together to generate the final confidence scores. Our method performs better than existing state-of-the-arts, especially when detecting small-size and occluded pedestrians. Furthermore, we propose a method for integrating pixel-wise semantic segmentation network into the network fusion architecture as a reinforcement to the pedestrian detector. In the second work, in addition to the first work, a fusion network is trained to fuse the multiple classification networks. Furthermore, a novel soft-label method is devised to assign floating point labels to the pedestrian candidates. This metric for each candidate detection is derived from the percentage of overlap of its bounding box with those of other ground truth classes. In the third work, we propose a boundary-sensitive deep neural network architecture for portrait segmentation. A residual network and atrous convolution based framework is trained as the base portrait segmentation network. To better solve boundary segmentation, three techniques are introduced. First, an individual boundary-sensitive kernel is introduced by labeling the boundary pixels as a separate class and using the soft-label strategy to assign floating-point label vectors to pixels in the boundary class. Each pixel contributes to multiple classes when updating loss based on its relative position to the contour. Second, a global boundary-sensitive kernel is used when updating loss function to assign different weights to pixel locations on one image to constrain the global shape of the resulted segmentation map. Third, we add multiple binary classifiers to classify boundary-sensitive portrait attributes, so as to refine the learning process of our model.Item GEOMETRIC REPRESENTATIONS AND DEEP GAUSSIAN CONDITIONAL RANDOM FIELD NETWORKS FOR COMPUTER VISION(2016) Vemulapalli, Raviteja; Chellappa, Rama; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Representation and context modeling are two important factors that are critical in the design of computer vision algorithms. For example, in applications such as skeleton-based human action recognition, representations that capture the 3D skeletal geometry are crucial for achieving good action recognition accuracy. However, most of the existing approaches focus mainly on the temporal modeling and classification steps of the action recognition pipeline instead of representations. Similarly, in applications such as image enhancement and semantic image segmentation, modeling the spatial context is important for achieving good performance. However, the standard deep network architectures used for these applications do not explicitly model the spatial context. In this dissertation, we focus on the representation and context modeling issues for some computer vision problems and make novel contributions by proposing new 3D geometry-based representations for recognizing human actions from skeletal sequences, and introducing Gaussian conditional random field model-based deep network architectures that explicitly model the spatial context by considering the interactions among the output variables. In addition, we also propose a kernel learning-based framework for the classification of manifold features such as linear subspaces and covariance matrices which are widely used for image set-based recognition tasks. This dissertation has been divided into five parts. In the first part, we introduce various 3D geometry-based representations for the problem of skeleton-based human action recognition. The proposed representations, referred to as R3DG features, capture the relative 3D geometry between various body parts using 3D rigid body transformations. We model human actions as curves in these R3DG feature spaces, and perform action recognition using a combination of dynamic time warping, Fourier temporal pyramid representation and support vector machines. Experiments on several action recognition datasets show that the proposed representations perform better than many existing skeletal representations. In the second part, we represent 3D skeletons using only the relative 3D rotations between various body parts instead of full 3D rigid body transformations. This skeletal representation is scale-invariant and belongs to a Lie group based on the special orthogonal group. We model human actions as curves in this Lie group and map these curves to the corresponding Lie algebra by combining the logarithm map with rolling maps. Using rolling maps reduces the distortions introduced in the action curves while mapping to the Lie algebra. Finally, we perform action recognition by classifying the Lie algebra curves using Fourier temporal pyramid representation and a support vector machines classifier. Experimental results show that by combining the logarithm map with rolling maps, we can get improved performance when compared to using the logarithm map alone. In the third part, we focus on classification of manifold features such as linear subspaces and covariance matrices. We present a kernel-based extrinsic framework for the classification of manifold features and address the issue of kernel selection using multiple kernel learning. We introduce two criteria for jointly learning the kernel and the classifier by solving a single optimization problem. In the case of support vector machine classifier, we formulate the problem of learning a good kernel-classifier combination as a convex optimization problem. The proposed approach performs better than many existing methods for the classification of manifold features when applied to image set-based classification task. In the fourth part, we propose a novel end-to-end trainable deep network architecture for image denoising based on a Gaussian Conditional Random Field (CRF) model. Contrary to existing discriminative denoising approaches, the proposed network explicitly models the input noise variance and hence is capable of handling a range of noise levels. This network consists of two sub-networks: (i) a parameter generation network that generates the Gaussian CRF pairwise potential parameters based on the input image, and (ii) an inference network whose layers perform the computations involved in an iterative Gaussian CRF inference procedure. Experiments on several images show that the proposed approach produces results on par with the state-of-the-art without training a separate network for each noise level. In the final part of this dissertation, we propose a Gaussian CRF model-based deep network architecture for the task of semantic image segmentation. This network explicitly models the interactions between output variables which is important for structured prediction tasks such as semantic segmentation. The proposed network is composed of three sub-networks: (i) a Convolutional Neural Network (CNN) based unary network for generating the unary potentials, (ii) a CNN-based pairwise network for generating the pairwise potentials, and (iii) a Gaussian mean field inference network for performing Gaussian CRF inference. When trained end-to-end in a discriminative fashion the proposed network outperforms various CNN-based semantic segmentation approaches.