A. James Clark School of Engineering
Permanent URI for this communityhttp://hdl.handle.net/1903/1654
The collections in this community comprise faculty research works, as well as graduate theses and dissertations.
Browse
5 results
Search Results
Item Activity Detection in Untrimmed Videos(2023) Gleason, Joshua D; Chellappa, Rama; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)In this dissertation, we present solutions to the problem of activity detection in untrimmed videos, where we are interested in identifying both when and where various activity instances occur within an unconstrained video. Advances in machine learning, particularly the widespread adoption of deep learning-based methods have yielded robust solutions to a number of historically difficult computer vision application domains. For example, recent systems for object recognition and detection, facial identification, and a number of language processing applications have found widespread commercial success. In some cases, such systems have been able to outperform humans. The same cannot be said for the problem of activity detection in untrimmed videos. This dissertation describes our investigation and innovative solutions for the challenging problem of real-time activity detection in untrimmed videos. The main contributions of our work are the introduction of multiple novel activity detection systems that make strides toward the goal of commercially viable activity detection. The first work introduces a proposal mechanism based on divisive hierarchical clustering of objects to produce cuboid activity proposals, followed by a classification and temporal refinement step. The second work proposes a chunk-based processing mechanism and explores the tradeoff between tube and cuboid proposals. The third work explores the topic of real-time activity detection and introduces strategies for achieving this performance. The final work provides a detailed look into multiple novel extensions that improve upon the state-of-the-art in the field.Item Learning Visual Classifiers From Limited Labeled Images(2013) Pillai, Jaishanker K.; Chellappa, Rama; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Recognizing humans and their activities from images and video is one of the key goals of computer vision. While supervised learning algorithms like Support Vector Machines and Boosting have offered robust solutions, they require large amount of labeled data for good performance. It is often difficult to acquire large labeled datasets due to the significant human effort involved in data annotation. However, it is considerably easier to collect unlabeled data due to the availability of inexpensive cameras and large public databases like Flickr and YouTube. In this dissertation, we develop efficient machine learning techniques for visual classification from small amount of labeled training data by utilizing the structure in the testing data, labeled data in a different domain and unlabeled data. This dissertation has three main parts. In the first part of the dissertation, we consider how multiple noisy samples available during testing can be utilized to perform accurate visual classification. Such multiple samples are easily available in video-based recognition problem, which is commonly encountered in visual surveillance. Specifically, we study the problem of unconstrained human recognition from iris images. We develop a Sparse Representation-based selection and recognition scheme, which learns the underlying structure of clean images. This learned structure is utilized to develop a quality measure, and a quality-based fusion scheme is proposed to combine the varying evidence. Furthermore, we extend the method to incorporate privacy, an important requirement inpractical biometric applications, without significantly affecting the recognition performance. In the second part, we analyze the problem of utilizing labeled data in a different domain to aid visual classification. We consider the problem of shifts in acquisition conditions during training and testing, which is very common in iris biometrics. In particular, we study the sensor mismatch problem, where the training samples are acquired using a sensor much older than the one used for testing. We provide one of the first solutions to this problem, a kernel learning framework to adapt iris data collected from one sensor to another. Extensive evaluations on iris data from multiple sensors demonstrate that the proposed method leads to considerable improvement in cross sensor recognition accuracy. Furthermore, since the proposed technique requires minimal changes to the iris recognition pipeline, it can easily be incorporated into existing iris recognition systems. In the last part of the dissertation, we analyze how unlabeled data available during training can assist visual classification applications. Here, we consider still image-based vision applications involving humans, where explicit motion cues are not available. A human pose often conveys not only the configuration of the body parts, but also implicit predictive information about the ensuing motion. We propose a probabilistic framework to infer this dynamic information associated with a human pose, using unlabeled and unsegmented videos available during training. The inference problem is posed as a non-parametric density estimation problem on non-Euclidean manifolds. Since direct modeling is intractable, we develop a data driven approach, estimating the density for the test sample under consideration. Statistical inference on the estimated density provides us with quantities of interest like the most probable future motion of the human and the amount of motion informationItem Activity Representation from Video Using Statistical Models on Shape Manifolds(2010) Abdelkader, Mohamed F.; Chellappa, Rama; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Activity recognition from video data is a key computer vision problem with applications in surveillance, elderly care, etc. This problem is associated with modeling a representative shape which contains significant information about the underlying activity. In this dissertation, we represent several approaches for view-invariant activity recognition via modeling shapes on various shape spaces and Riemannian manifolds. The first two parts of this dissertation deal with activity modeling and recognition using tracks of landmark feature points. The motion trajectories of points extracted from objects involved in the activity are used to build deformation shape models for each activity, and these models are used for classification and detection of unusual activities. In the first part of the dissertation, these models are represented by the recovered 3D deformation basis shapes corresponding to the activity using a non-rigid structure from motion formulation. We use a theory for estimating the amount of deformation for these models from the visual data. We study the special case of ground plane activities in detail because of its importance in video surveillance applications. In the second part of the dissertation, we propose to model the activity by learning an affine invariant deformation subspace representation that captures the space of possible body poses associated with the activity. These subspaces can be viewed as points on a Grassmann manifold. We propose several statistical classification models on Grassmann manifold that capture the statistical variations of the shape data while following the intrinsic Riemannian geometry of these manifolds. The last part of this dissertation addresses the problem of recognizing human gestures from silhouette images. We represent a human gesture as a temporal sequence of human poses, each characterized by a contour of the associated human silhouette. The shape of a contour is viewed as a point on the shape space of closed curves and, hence, each gesture is characterized and modeled as a trajectory on this shape space. We utilize the Riemannian geometry of this space to propose a template-based and a graphical-based approaches for modeling these trajectories. The two models are designed in such a way to account for the different invariance requirements in gesture recognition, and also capture the statistical variations associated with the contour data.Item Statistical and Geometric Modeling of Spatio-Temporal Patterns for Video Understanding(2009) Turaga, Pavan; Chellappa, Ramalingam; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Spatio-temporal patterns abound in the real world, and understanding them computationally holds the promise of enabling a large class of applications such as video surveillance, biometrics, computer graphics and animation. In this dissertation, we study models and algorithms to describe complex spatio-temporal patterns in videos for a wide range of applications. The spatio-temporal pattern recognition problem involves recognizing an input video as an instance of a known class. For this problem, we show that a first order Gauss-Markov process is an appropriate model to describe the space of primitives. We then show that the space of primitives is not a Euclidean space but a Riemannian manifold. We use the geometric properties of this manifold to define distances and statistics. This then paves the way to model temporal variations of the primitives. We then show applications of these techniques in the problem of activity recognition and pattern discovery from long videos. The pattern discovery problem on the other hand, requires uncovering patterns from large datasets in an unsupervised manner for applications such as automatic indexing and tagging. Most state-of-the-art techniques index videos according to the global content in the scene such as color, texture and brightness. In this dissertation, we discuss the problem of activity based indexing of videos. We examine the various issues involved in such an effort and describe a general framework to address the problem. We then design a cascade of dynamical systems model for clustering videos based on their dynamics. We augment the traditional dynamical systems model in two ways. Firstly, we describe activities as a cascade of dynamical systems. This significantly enhances the expressive power of the model while retaining many of the computational advantages of using dynamical models. Secondly, we also derive methods to incorporate view and rate-invariance into these models so that similar actions are clustered together irrespective of the viewpoint or the rate of execution of the activity. We also derive algorithms to learn the model parameters from a video stream and demonstrate how a given video sequence may be segmented into different clusters where each cluster represents an activity. Finally, we show the broader impact of the algorithms and tools developed in this dissertation for several image-based recognition problems that involve statistical inference over non-Euclidean spaces. We demonstrate how an understanding of the geometry of the underlying space leads to methods that are more accurate than traditional approaches. We present examples in shape analysis, object recognition, video-based face recognition, and age-estimation from facial features to demonstrate these ideas.Item Shape Dynamical Models for Activity Recognition and Coded Aperture Imaging for Light-Field Capture(2008-11-21) Veeraraghavan, Ashok N; Chellappa, Rama; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Classical applications of Pattern recognition in image processing and computer vision have typically dealt with modeling, learning and recognizing static patterns in images and videos. There are, of course, in nature, a whole class of patterns that dynamically evolve over time. Human activities, behaviors of insects and animals, facial expression changes, lip reading, genetic expression profiles are some examples of patterns that are dynamic. Models and algorithms to study these patterns must take into account the dynamics of these patterns while exploiting the classical pattern recognition techniques. The first part of this dissertation is an attempt to model and recognize such dynamically evolving patterns. We will look at specific instances of such dynamic patterns like human activities, and behaviors of insects and develop algorithms to learn models of such patterns and classify such patterns. The models and algorithms proposed are validated by extensive experiments on gait-based person identification, activity recognition and simultaneous tracking and behavior analysis of insects. The problem of comparing dynamically deforming shape sequences arises repeatedly in problems like activity recognition and lip reading. We describe and evaluate parametric and non-parametric models for shape sequences. In particular, we emphasize the need to model activity execution rate variations and propose a non-parametric model that is insensitive to such variations. These models and the resulting algorithms are shown to be extremely effective for a wide range of applications from gait-based person identification to human action recognition. We further show that the shape dynamical models are not only effective for the problem of recognition, but also can be used as effective priors for the problem of simultaneous tracking and behavior analysis. We validate the proposed algorithm for performing simultaneous behavior analysis and tracking on videos of bees dancing in a hive. In the last part of this dissertaion, we investigate computational imaging, an emerging field where the process of image formation involves the use of a computer. The current trend in computational imaging is to capture as much information about the scene as possible during capture time so that appropriate images with varying focus, aperture, blur and colorimetric settings may be rendered as required. In this regard, capturing the 4D light-field as opposed to a 2D image allows us to freely vary viewpoint and focus at the time of rendering an image. In this dissertation, we describe a theoretical framework for reversibly modulating {4D} light fields using an attenuating mask in the optical path of a lens based camera. Based on this framework, we present a novel design to reconstruct the {4D} light field from a {2D} camera image without any additional refractive elements as required by previous light field cameras. The patterned mask attenuates light rays inside the camera instead of bending them, and the attenuation recoverably encodes the rays on the {2D} sensor. Our mask-equipped camera focuses just as a traditional camera to capture conventional {2D} photos at full sensor resolution, but the raw pixel values also hold a modulated {4D} light field. The light field can be recovered by rearranging the tiles of the {2D} Fourier transform of sensor values into {4D} planes, and computing the inverse Fourier transform. In addition, one can also recover the full resolution image information for the in-focus parts of the scene.