Electrical & Computer Engineering Theses and Dissertations
Permanent URI for this collectionhttp://hdl.handle.net/1903/2765
Browse
5 results
Search Results
Item DEEP LEARNING FOR FORENSICS(2020) Zhou, Peng; Davis, Larry; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The advent of media sharing platforms and the easy availability of advanced photo or video editing software have resulted in a large quantity of manipulated images and videos being shared on the internet. While the intent behind such manipulations varies widely, concerns on the spread of fake news and misinformation is growing. Therefore, detecting manipulation has become an emerging necessity. Different from traditional classification, semantic object detection or segmentation, manipulation detection/classification pays more attention to low-level tampering artifacts than to semantic content. The main challenges in this problem include (a) investigating features to reveal tampering artifacts, (b) developing generic models which are robust to a large scale of post-processing methods, (c) applying algorithms to higher resolution in real scenarios and (d) handling the new emerging manipulation techniques. In this dissertation, we propose approaches to tackling these challenges. Manipulation detection utilizes both low-level tamper artifacts and semantic contents, suggesting that richer features needed to be harnessed to reveal more evidence. To learn rich features, we propose a two-stream Faster R-CNN network and train it end-to-end to detect the tampered regions given a manipulated image. Experiments on four standard image manipulation datasets demonstrate that our two-stream framework outperforms each individual stream, and also achieves state-of-the-art performance compared to alternative methods with robustness to resizing and compression. Additionally, to extend manipulation detection from image to video, we introduce VIDNet, Video Inpainting Detection Network, which contains an encoder-decoder architecture with a quad-directional local attention module. To reveal artifacts encoded in compression, VIDNet additionally takes in Error Level Analysis (ELA) frames to augment RGB frames, producing multimodal features at different levels with an encoder. Besides, to improve the generalization of manipulation detection model, we introduce a manipulated image generation process that creates true positives using currently available datasets. Drawing from traditional work on image blending, we propose a novel generator for creating such examples. In addition, we also propose to further create examples that force the algorithm to focus on boundary artifacts during training. Extensive experimental results validate our proposal. Furthermore, to apply deep learning models to high resolution scenarios efficiently, we treat the problem as a mask refinement given a coarse low resolution prediction. We propose to convert the regions of interest into strip images and compute a boundary prediction in the strip domain. Extensive experiments on both the public and a newly created high resolution dataset strongly validate our approach. Finally, to handle new emerging manipulation techniques while preserving performance on learned manipulation, we investigate incremental learning. We propose a multi-model and multi-level knowledge distillation strategy to preserve performance on old categories while training on new categories. Experiments on standard incremental learning benchmarks show that our method improves the overall performance over standard distillation techniques.Item COMPUTER VISION AND DEEP LEARNING WITH APPLICATIONS TO OBJECT DETECTION, SEGMENTATION, AND DOCUMENT ANALYSIS(2017) Du, Xianzhi; Davis, Larry; Doermann, David; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)There are three work on signature matching for document analysis. In the first work, we propose a large-scale signature matching method based on locality sensitive hashing (LSH). Shape Context features are used to describe the structure of signatures. Two stages of hashing are performed to find the nearest neighbors for query signatures. We show that our algorithm can achieve a high accuracy even when few signatures are collected from one same person and perform fast matching when dealing with a large dataset. In the second work, we present a novel signature matching method based on supervised topic models. Shape Context features are extracted from signature shape contours which capture the local variations in signature properties. We then use the concept of topic models to learn the shape context features which correspond to individual authors. We demonstrate considerable improvement over state of the art methods. In the third work, we present a partial signature matching method using graphical models. In additional to the second work, modified shape context features are extracted from the contour of signatures to describe both full and partial signatures. Hierarchical Dirichlet processes are implemented to infer the number of salient regions needed. The results show the effectiveness of the approach for both the partial and full signature matching. There are three work on deep learning for object detection and segmentation. In the first work, we propose a deep neural network fusion architecture for fast and robust pedestrian detection. The proposed network fusion architecture allows for parallel processing of multiple networks for speed. A single shot deep convolutional network is trained as an object detector to generate all possible pedestrian candidates of different sizes and occlusions. Next, multiple deep neural networks are used in parallel for further refinement of these pedestrian candidates. We introduce a soft-rejection based network fusion method to fuse the soft metrics from all networks together to generate the final confidence scores. Our method performs better than existing state-of-the-arts, especially when detecting small-size and occluded pedestrians. Furthermore, we propose a method for integrating pixel-wise semantic segmentation network into the network fusion architecture as a reinforcement to the pedestrian detector. In the second work, in addition to the first work, a fusion network is trained to fuse the multiple classification networks. Furthermore, a novel soft-label method is devised to assign floating point labels to the pedestrian candidates. This metric for each candidate detection is derived from the percentage of overlap of its bounding box with those of other ground truth classes. In the third work, we propose a boundary-sensitive deep neural network architecture for portrait segmentation. A residual network and atrous convolution based framework is trained as the base portrait segmentation network. To better solve boundary segmentation, three techniques are introduced. First, an individual boundary-sensitive kernel is introduced by labeling the boundary pixels as a separate class and using the soft-label strategy to assign floating-point label vectors to pixels in the boundary class. Each pixel contributes to multiple classes when updating loss based on its relative position to the contour. Second, a global boundary-sensitive kernel is used when updating loss function to assign different weights to pixel locations on one image to constrain the global shape of the resulted segmentation map. Third, we add multiple binary classifiers to classify boundary-sensitive portrait attributes, so as to refine the learning process of our model.Item GEOMETRIC REPRESENTATIONS AND DEEP GAUSSIAN CONDITIONAL RANDOM FIELD NETWORKS FOR COMPUTER VISION(2016) Vemulapalli, Raviteja; Chellappa, Rama; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Representation and context modeling are two important factors that are critical in the design of computer vision algorithms. For example, in applications such as skeleton-based human action recognition, representations that capture the 3D skeletal geometry are crucial for achieving good action recognition accuracy. However, most of the existing approaches focus mainly on the temporal modeling and classification steps of the action recognition pipeline instead of representations. Similarly, in applications such as image enhancement and semantic image segmentation, modeling the spatial context is important for achieving good performance. However, the standard deep network architectures used for these applications do not explicitly model the spatial context. In this dissertation, we focus on the representation and context modeling issues for some computer vision problems and make novel contributions by proposing new 3D geometry-based representations for recognizing human actions from skeletal sequences, and introducing Gaussian conditional random field model-based deep network architectures that explicitly model the spatial context by considering the interactions among the output variables. In addition, we also propose a kernel learning-based framework for the classification of manifold features such as linear subspaces and covariance matrices which are widely used for image set-based recognition tasks. This dissertation has been divided into five parts. In the first part, we introduce various 3D geometry-based representations for the problem of skeleton-based human action recognition. The proposed representations, referred to as R3DG features, capture the relative 3D geometry between various body parts using 3D rigid body transformations. We model human actions as curves in these R3DG feature spaces, and perform action recognition using a combination of dynamic time warping, Fourier temporal pyramid representation and support vector machines. Experiments on several action recognition datasets show that the proposed representations perform better than many existing skeletal representations. In the second part, we represent 3D skeletons using only the relative 3D rotations between various body parts instead of full 3D rigid body transformations. This skeletal representation is scale-invariant and belongs to a Lie group based on the special orthogonal group. We model human actions as curves in this Lie group and map these curves to the corresponding Lie algebra by combining the logarithm map with rolling maps. Using rolling maps reduces the distortions introduced in the action curves while mapping to the Lie algebra. Finally, we perform action recognition by classifying the Lie algebra curves using Fourier temporal pyramid representation and a support vector machines classifier. Experimental results show that by combining the logarithm map with rolling maps, we can get improved performance when compared to using the logarithm map alone. In the third part, we focus on classification of manifold features such as linear subspaces and covariance matrices. We present a kernel-based extrinsic framework for the classification of manifold features and address the issue of kernel selection using multiple kernel learning. We introduce two criteria for jointly learning the kernel and the classifier by solving a single optimization problem. In the case of support vector machine classifier, we formulate the problem of learning a good kernel-classifier combination as a convex optimization problem. The proposed approach performs better than many existing methods for the classification of manifold features when applied to image set-based classification task. In the fourth part, we propose a novel end-to-end trainable deep network architecture for image denoising based on a Gaussian Conditional Random Field (CRF) model. Contrary to existing discriminative denoising approaches, the proposed network explicitly models the input noise variance and hence is capable of handling a range of noise levels. This network consists of two sub-networks: (i) a parameter generation network that generates the Gaussian CRF pairwise potential parameters based on the input image, and (ii) an inference network whose layers perform the computations involved in an iterative Gaussian CRF inference procedure. Experiments on several images show that the proposed approach produces results on par with the state-of-the-art without training a separate network for each noise level. In the final part of this dissertation, we propose a Gaussian CRF model-based deep network architecture for the task of semantic image segmentation. This network explicitly models the interactions between output variables which is important for structured prediction tasks such as semantic segmentation. The proposed network is composed of three sub-networks: (i) a Convolutional Neural Network (CNN) based unary network for generating the unary potentials, (ii) a CNN-based pairwise network for generating the pairwise potentials, and (iii) a Gaussian mean field inference network for performing Gaussian CRF inference. When trained end-to-end in a discriminative fashion the proposed network outperforms various CNN-based semantic segmentation approaches.Item Efficient Sensing, Summarization and Classification of Videos(2012) Shroff, Nitesh; Chellappa, Rama; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Motion perception is an integral part of visual information processing. For example, humans use motion to perceive shape and structure of a scene, segment and recognize objects. Similarly, in computational vision, motion cues have been extensively used in numerous applications e.g., reconstructing 3D structure, object segmentation, etc. But there are several other applications such as pose estimation, scene recognition, etc., where motion plays a unique role, but traditionally they have been studied using cues other than motion. In this dissertation, we study few such applications with a focus on characterizing the role of motion. In particular, we study the role of motion in efficient (a) sensing, (b) summarization, and (c) classification of videos. We start by developing efficient sensing techniques, particularly in cases where computational vision is used for measurement -- inferring depth, position, orientation, etc. of the scene elements. Towards this direction, we begin with the goal of devising sensing techniques that allows the estimation of the scene layout of a generic scene i.e., the depth map of a scene. This is achieved by proposing an architecture and algorithm that senses the video by varying focal settings between consecutive frames. By extending the paradigm of Depth-from-defocus (DFD) to dynamic scenes, we achieve the reconstruction of the depth video and all-focus video from the captured video. This is followed by devising a technique which under constrained scenarios allows us to take a step further and estimate the precise location and orientation of the objects in the scene. We show that by capturing a sequence of images, while moving the illumination source between two consecutive frames, we can extract specular features on the high-curvature metallic objects. Robustly extracted specular features then allow us to estimate the pose of the objects with applications in machine vision. Next, we address the problem of concisely representing large video data. The goal here is to gain a quick overview of the video with minimum loss of details. We argue that this can be achieved by optimizing for the following two conflicting criteria: (a) Coverage -- requires that the summary be able to represent the original video well, and (b) Diversity -- requires that the elements of the summary be as distinct from each other as possible. This is formulated as a subset selection problem first in the Euclidean space and then generalized to non-Euclidean manifolds. The generic non-Euclidean manifold formulation allows the algorithm to handle generic computer-vision datasets like shapes, textures, linear dynamical systems, etc. A novel annealing-based alternation algorithm is proposed to select the optimal subset. Our experimental evaluation convincingly demonstrates that this formulation, effectively highlights diverse motion patterns in the video and hence outputs good summaries without actually using any domain knowledge. Finally, we turn our attention to classification of videos. Here, we begin with devising exact and approximate nearest neighbor (NN) techniques for fast retrieval of videos from large databases. As these videos or their representations, lie in non-Euclidean manifolds, the focus here is on formulating the problem such that it utilizes the geometry of the space. We present a geodesic hashing technique which employs intrinsic geodesic based functions to hash the data for realizing approximate but fast nearest neighbor retrieval. The proposed family of hashing functions, although intrinsic, is optimally selected to empirically satisfy the Locality Sensitive Hashing property. This work is followed up by another classification technique which focuses on generating content-based, particularly scene-based, annotations of videos. We focus on characterizing the motion of scene elements, and show that it not only provides fine-grained description of videos but also improves the classification accuracy. Subsequently, we propose dynamic attributes which can be augmented with spatial attributes of a scene to categorize dynamic scenes in a semantically meaningful way.Item Activity Representation from Video Using Statistical Models on Shape Manifolds(2010) Abdelkader, Mohamed F.; Chellappa, Rama; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Activity recognition from video data is a key computer vision problem with applications in surveillance, elderly care, etc. This problem is associated with modeling a representative shape which contains significant information about the underlying activity. In this dissertation, we represent several approaches for view-invariant activity recognition via modeling shapes on various shape spaces and Riemannian manifolds. The first two parts of this dissertation deal with activity modeling and recognition using tracks of landmark feature points. The motion trajectories of points extracted from objects involved in the activity are used to build deformation shape models for each activity, and these models are used for classification and detection of unusual activities. In the first part of the dissertation, these models are represented by the recovered 3D deformation basis shapes corresponding to the activity using a non-rigid structure from motion formulation. We use a theory for estimating the amount of deformation for these models from the visual data. We study the special case of ground plane activities in detail because of its importance in video surveillance applications. In the second part of the dissertation, we propose to model the activity by learning an affine invariant deformation subspace representation that captures the space of possible body poses associated with the activity. These subspaces can be viewed as points on a Grassmann manifold. We propose several statistical classification models on Grassmann manifold that capture the statistical variations of the shape data while following the intrinsic Riemannian geometry of these manifolds. The last part of this dissertation addresses the problem of recognizing human gestures from silhouette images. We represent a human gesture as a temporal sequence of human poses, each characterized by a contour of the associated human silhouette. The shape of a contour is viewed as a point on the shape space of closed curves and, hence, each gesture is characterized and modeled as a trajectory on this shape space. We utilize the Riemannian geometry of this space to propose a template-based and a graphical-based approaches for modeling these trajectories. The two models are designed in such a way to account for the different invariance requirements in gesture recognition, and also capture the statistical variations associated with the contour data.