UMD Theses and Dissertations
Permanent URI for this collectionhttp://hdl.handle.net/1903/3
New submissions to the thesis/dissertation collections are added automatically as they are received from the Graduate School. Currently, the Graduate School deposits all theses and dissertations from a given semester after the official graduation date. This means that there may be up to a 4 month delay in the appearance of a given thesis/dissertation in DRUM.
More information is available at Theses and Dissertations at University of Maryland Libraries.
Browse
2 results
Search Results
Item Domain Transfer Learning for Object and Action Recognition(2015) Zheng, Jingjing; Chellappa, Rama; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Visual recognition has always been a fundamental problem in computer vision. Its task is to learn visual categories using labeled training data and then identify unlabeled new instances of those categories. However, due to the large variations in visual data, visual recognition is still a challenging problem. Handling the variations in captured images is important for real-world applications where unconstrained data acquisition scenarios are widely prevalent. In this dissertation, we first address the variations between training and testing data. Particularly, for cross-domain object recognition, we propose a Grassmann manifold-based domain adaptation approach to model the domain shift using the geodesic connecting the source and target domains. We further measure the distance between two data points from different domains by integrating the distance of their projections through all the intermediate subspaces along the geodesic. Our proposed approach that exploits all the intermediate subspaces along the geodesic produces a more accurate metric. For cross-view action recognition, we present two effective approaches to learn transferable dictionaries and view-invariant sparse representations. In the first approach, we learn a set of transferable dictionaries where each dictionary corresponds to one camera view. The set of dictionaries is learned simultaneously from sets of correspondence videos taken at different views with the aim of encouraging each video in the set to have the same sparse representation. In the second approach, we relaxes this constraint by encouraging correspondence videos to have similar sparse representations. In addition, we learn a common dictionary that is incoherent to view-specific dictionaries for cross-view action recognition. The set of view-specific dictionaries is learned for specific views while the common dictionary is shared across different views. In this way, we can align view-specific features in the sparse feature spaces spanned by the view-specific dictionary set and transfer the view-shared features in the sparse feature space spanned by the common dictionary. In order to handle the more general variations in captured images, we also exploit the semantic information to learn discriminative feature representations for visual recognition. Class labels are often organized in a hierarchical taxonomy based on their semantic meanings. We propose a novel multi-layer hierarchical dictionary learning framework for region tagging. Specifically, we learn a node-specific dictionary for each semantic label in the taxonomy and preserve the hierarchial semantic structure in the relationship among these node-dictionaries. Our approach can also transfer knowledge from semantic label at higher levels to help learn the classifiers for semantic labels at lower levels. Moreover, we exploit the semantic attributes for boosting the performance of visual recognition. We encode objects or actions based on attributes that describe them as high-level concepts. We consider two types of attributes. One type of attributes is generated by humans, while the second type is data-driven attributes extracted from data using dictionary learning methods. Attribute-based representation may exhibit variations due to noisy and redundant attributes. We propose a discriminative and compact attribute-based representation by selecting a subset of discriminative attributes from a large attribute set. Three attribute selection criteria are proposed and formulated as a submodular optimization problem. A greedy optimization algorithm is presented and its solution is guaranteed to be at least (1-1/e)-approximation to the optimum.Item Analyzing Complex Events and Human Actions in "in-the-wild" Videos(2014) Lee, Hyungtae; Davis, Larry S; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)We are living in a world where it is easy to acquire videos of events ranging from private picnics to public concerts, and to share them publicly via websites such as YouTube. The ability of smart-phones to create these videos and upload them to the internet has led to an explosion of video data, which in turn has led to interesting research directions involving the analysis of ``in-the-wild'' videos. To process these types of videos, various recognition tasks such as pose estimation, action recognition, and event recognition become important in computer vision. This thesis presents various recognition problems and proposes mid-level models to address them. First, a discriminative deformable part model is presented for the recovery of qualitative pose, inferring coarse pose labels (e:g: left, front-right, back), a task more robust to common confounding factors that hinder the inference of exact 2D or 3D joint locations. Our approach automatically selects parts that are predictive of qualitative pose and trains their appearance and deformation costs to best discriminate between qualitative poses. Unlike previous approaches, our parts are both selected and trained to improve qualitative pose discrimination and are shared by all the qualitative pose models. This leads to both increased accuracy and higher efficiency, since fewer parts models are evaluated for each image. In comparisons with two state-of-the-art approaches on a public dataset, our model shows superior performance. Second, the thesis proposes the use of a robust pose feature based on part based human detectors (Poselets) for the task of action recognition in relatively unconstrained videos, i.e., collected from the web. This feature, based on the original poselets activation vector, coarsely models pose and its transitions over time. Our main contributions are that we improve the original feature's compactness and discriminability by greedy set cover over subsets of joint configurations, and incorporate it into a unified video-based action recognition framework. Experiments shows that the pose feature alone is extremely informative, yielding performance that matches most state-of-the-art approaches but only using our proposed improvements to its compactness and discriminability. By combining our pose feature with motion and shape, the proposed method outperforms state-of-the-art approaches on two public datasets. Third, clauselets, sets of concurrent actions and their temporal relationships, are proposed and explored their application to video event analysis. Clauselets are trained in two stages. Initially, clauselet detectors that find a limited set of actions in particular qualitative temporal configurations based on Allen's interval relations is trained. In the second stage, the first level detectors are applied to training videos, and discriminatively learn temporal patterns between activations that involve more actions over longer durations and lead to improved second level clauselet models. The utility of clauselets is demonstrated by applying them to the task of ``in-the-wild'' video event recognition on the TRECVID MED 11 dataset. Not only do clauselets achieve state-of-the-art results on this task, but qualitative results suggest that they may also lead to semantically meaningful descriptions of videos in terms of detected actions and their temporal relationships. Finally, the thesis addresses the task of searching for videos given text queries that are not known at training time, which typically involves zero-shot learning, where detectors for a large set of concepts, attributes, or objects parts are learned under the assumption that, once the search query is known, they can be combined to detect novel complex visual categories. These detectors are typically trained on annotated training data that is time-consuming and expensive to obtain, and a successful system requires many of them to generalize well at test time. In addition, these detectors are so general that they are not well-tuned to the specific query or target data, since neither is known at training. Our approach addresses the annotation problem by searching the web to discover visual examples of short text phrases. Top ranked search results are used to learn general, potentially noisy, visual phrase detectors. Given a search query and a target dataset, the visual phrase detectors are adapted to both the query and unlabeled target data to remove the influence of incorrect training examples or correct examples that are irrelevant to the search query. Our adaptation process exploits the spatio-temporal coocurrence of visual phrases that are found in the target data and which are relevant to the search query by iteratively refining both the visual phrase detectors and spatio-temporally grouped phrase detections (`clauselets'). Our approach is demonstrated on to the challenging TRECVID MED13 EK0 dataset and show that, using visual features alone, our approach outperforms state-of-the-art approaches that use visual, audio, and text (OCR) features.