Computer Science Theses and Dissertations
Permanent URI for this collectionhttp://hdl.handle.net/1903/2756
Browse
35 results
Search Results
Item AN ANALYSIS OF BOTTOM-UP ATTENTION MODELS AND MULTIMODAL REPRESENTATION LEARNING FOR VISUAL QUESTION ANSWERING(2019) Narayanan, Venkatraman; Shrivastava, Abhinav; Systems Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)A Visual Question Answering (VQA) task is the ability of a system to take an image and an open-ended, natural language question about the image and provide a natural language text answer as the output. The VQA task is a relatively nascent field, with only a few strategies explored. The performance of the VQA system, in terms of accuracy of answers to the image-question pairs, requires a considerable overhaul before the system can be used in practice. The general system for performing the VQA task consists of an image encoder network, a question encoder network, a multi-modal attention network that combines the information obtained image and question, and answering network that generates natural language answers for the image-question pair. In this thesis, we follow two strategies to improve the performance (accuracy) of VQA. The first is a representation learning approach (utilizing the state-of-the-art Generative Adversarial Models (GANs) (Goodfellow, et al., 2014)) to improve the image encoding system of VQA. This thesis evaluates four variants of GANs to identify a GAN architecture that best captures the data distribution of the images, and it was determined that GAN variants become unstable and fail to become a viable image encoding system in VQA. The second strategy is to evaluate an alternative approach to the attention network, using multi-modal compact bilinear pooling, in the existing VQA system. The second strategy led to an increase in the accuracy of VQA by 2% compared to the current state-of-the-art technique.Item Formality Style Transfer Within and Across Languages with Limited Supervision(2019) Niu, Xing; Carpuat, Marine; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)While much natural language processing work focuses on analyzing language content, language style also conveys important information about the situational context and purpose of communication. When editing an article, professional editors take into account the target audience to select appropriate word choice and grammar. Similarly, professional translators translate documents for a specific audience and often ask what is the expected tone of the content when taking a translation job. Computational models of natural language should consider both their meaning and style. Controlling style is an emerging research area in text rewriting and is under-investigated in machine translation. In this dissertation, we present a new perspective which closely connects formality transfer and machine translation: we aim to control style in language generation with a focus on rewriting English or translating French to English with a desired formality. These are challenging tasks because annotated examples of style transfer are only available in limited quantities. We first address this problem by inducing a lexical formality model based on word embeddings and a small number of representative formal and informal words. This enables us to assign sentential formality scores and rerank translation hypotheses whose formality scores are closer to user-provided formality level. To capture broader formality changes, we then turn to neural sequence to sequence models. Joint modeling of formality transfer and machine translation enables formality control in machine translation without dedicated training examples. Along the way, we also improve low-resource neural machine translation.Item Interpreting Machine Learning Models and Application of Homotopy Methods(2019) Yousefzadeh, Roozbeh; O'Leary, Dianne P; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Neural networks have been criticized for their lack of easy interpretation, which undermines confidence in their use for important applications. We show that a trained neural network can be interpreted using flip points. A flip point is any point that lies on the boundary between two output classes: e.g. for a neural network with a binary yes/no output, a flip point is any input that generates equal scores for ``yes" and ``no". The flip point closest to a given input is of particular importance, and this point is the solution to a well-posed optimization problem. We show that computing closest flip points allows us, for example, to systematically investigate the decision boundaries of trained networks, to interpret and audit them with respect to individual inputs and entire datasets, and to find vulnerability against adversarial attacks. We demonstrate that flip points can help identify mistakes made by a model, improve its accuracy, and reveal the most influential features for classifications. We also show that some common assumptions about the decision boundaries of neural networks can be unreliable. Additionally, we present methods for designing the structure of feed-forward networks using matrix conditioning. At the end, we investigate an unsupervised learning method, the Gaussian graphical model, and provide mathematical tools for interpretation.Item Reasoning about Geometric Object Interactions in 3D for Manipulation Action Understanding(2019) Zampogiannis, Konstantinos; Aloimonos, Yiannis; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)In order to efficiently interact with human users, intelligent agents and autonomous systems need the ability of interpreting human actions. We focus our attention on manipulation actions, wherein an agent typically grasps an object and moves it, possibly altering its physical state. Agent-object and object-object interactions during a manipulation are a defining part of the performed action itself. In this thesis, we focus on extracting semantic cues, derived from geometric object interactions in 3D space during a manipulation, that are useful for action understanding at the cognitive level. First, we introduce a simple grounding model for the most common pairwise spatial relations between objects and investigate the descriptive power of their temporal evolution for action characterization. We propose a compact, abstract action descriptor that encodes the geometric object interactions during action execution, as captured by the spatial relation dynamics. Our experiments on a diverse dataset confirm both the validity and effectiveness of our spatial relation models and the discriminative power of our representation with respect to the underlying action semantics. Second, we model and detect lower level interactions, namely object contacts and separations, viewing them as topological scene changes within a dense motion estimation setting. In addition to improving motion estimation accuracy in the challenging case of motion boundaries induced by these events, our approach shows promising performance in the explicit detection and classification of the latter. Building upon dense motion estimation and using detected contact events as an attention mechanism, we propose a bottom-up pipeline for the guided segmentation and rigid motion extraction of manipulated objects. Finally, in addition to our methodological contributions, we introduce a new open-source software library for point cloud data processing, developed for the needs of this thesis, which aims at providing an easy to use, flexible, and efficient framework for the rapid development of performant software for a range of 3D perception tasks.Item Connecting Documents, Words, and Languages Using Topic Models(2019) Yang, Weiwei; Boyd-Graber, Jordan L; Resnik, Philip S; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Topic models discover latent topics in documents and summarize documents at a high level. To improve topic models' topic quality and extrinsic performance, external knowledge is often incorporated as part of the generative story. One form of external knowledge is weighted text links that indicate similarity or relatedness between the connected objects. This dissertation 1) uncovers the latent structures in observed weighted links and integrates them into topic modeling, and 2) learns latent weighted links from other external knowledge to improve topic modeling. We consider incorporating links at three different levels: documents, words, and topics. We first look at binary document links, e.g., citation links of papers. Document links indicate topic similarity of the connected documents. Past methods model the document links separately, ignoring the entire link density. We instead uncover latent document blocks in which documents are densely connected and tend to talk about similar topics. We introduce LBH-RTM, a relational topic model with lexical weights, block priors, and hinge loss. It extracts informative topic priors from the document blocks for documents' topic generation. It predicts unseen document links with block and lexical features and hinge loss, in addition to topical features. It outperforms past methods in link prediction and gives more coherent topics. Like document links, words are also linked, but usually with real-valued weights. Word links are known as word associations and indicate the semantic relatedness of the connected words. They provide more information about word relationships in addition to the co-occurrence patterns in the training corpora. To extract and incorporate the knowledge in word associations, we introduce methods to find the most salient word pairs. The methods organize the words in a tree structure, which serves as a prior (i.e., tree prior) for tree LDA. The methods are straightforward but effective, yielding more coherent topics than vanilla LDA, and slightly improving the extrinsic classification performance. Weighted topic links are different. Topics are latent, so it is difficult to obtain ground-truth topic links, but learned weighted topic links could bridge the topics across languages. We introduce a multilingual topic model (MTM) that assumes each language has its own topic distributions over the words only in that language and learns weighted topic links based on word translations and words' topic distributions. It does not force the topic spaces of different languages to be aligned and is more robust than previous MTMs that do. It outperforms past MTMs in classification while still giving coherent topics on less comparable and smaller corpora.Item Teaching Machines to Ask Useful Clarification Questions(2018) Rao, Sudha; Daumé III, Hal; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Inquiry is fundamental to communication, and machines cannot effectively collaborate with humans unless they can ask questions. Asking questions is also a natural way for machines to express uncertainty, a task of increasing importance in an automated society. In the field of natural language processing, despite decades of work on question answering, there is relatively little work in question asking. Moreover, most of the previous work has focused on generating reading comprehension style questions which are answerable from the provided text. The goal of my dissertation work, on the other hand, is to understand how can we teach machines to ask clarification questions that point at the missing information in a text. Primarily, we focus on two scenarios where we find such question asking to be useful: (1) clarification questions on posts found in community-driven technical support forums such as StackExchange (2) clarification questions on descriptions of products in e-retail platforms such as Amazon. In this dissertation we claim that, given large amounts of previously asked questions in various contexts (within a particular scenario), we can build machine learning models that can ask useful questions in a new unseen context (within the same scenario). In order to validate this hypothesis, we firstly create two large datasets of context paired with clarification question (and answer) for the two scenarios of technical support and e-retail by automatically extracting these information from available datadumps of StackExchange and Amazon. Given these datasets, in our first line of research, we build a machine learning model that first extracts a set of candidate clarification questions and then ranks them such that a more useful question would be higher up in the ranking. Our model is inspired by the idea of expected value of perfect information: a good question is one whose expected answer will be useful. We hypothesize that by explicitly modeling the value added by an answer to a given context, our model can learn to identify more useful questions. We evaluate our model against expert human judgments on the StackExchange dataset and demonstrate significant improvements over controlled baselines. In our second line of research, we build a machine learning model that learns to generate a new clarification question from scratch, instead of ranking previously seen questions. We hypothesize that we can train our model to generate good clarification questions by incorporating the usefulness of an answer to the clarification question into the recent sequence-to-sequence based neural network approaches. We develop a Generative Adversarial Network (GAN) where the generator is a sequence-to-sequence model and the discriminator is a utility function that models the value of updating the context with the answer to the clarification question. We evaluate our model on our two datasets of StackExchange and Amazon, using both automatic metrics and human judgments of usefulness, specificity and relevance, showing that our approach outperforms both a retrieval-based model and ablations that exclude the utility model and the adversarial training. We observe that our question generation model generates questions that range a wide spectrum of specificity to the given context. We argue that generating questions at a desired level of specificity (to a given context) can be useful in many scenarios. In our last line of research we, therefore, build a question generation model which given a context and a level of specificity (generic or specific), generates a question at that level of specificity. We hypothesize that by providing the level of specificity of the question to our model during training time, it can learn patterns in the question that indicate the level of specificity and use those to generate questions at a desired level of specificity. To automatically label the large number of questions in our training data with the level of specificity, we train a binary classifier which given a context and a question, predicts whether the question is specific (to the context) or generic. We demonstrate the effectiveness of our specificity-controlled question generation model by evaluating it on the Amazon dataset using human judgements.Item GRAPH-BASED METHODS FOR PATH PLANNING WITH DYNAMIC OBSTACLES USING LINEAR TEMPORAL LOGIC(2018) Han, Wenqi; Herrmann, Jeffrey; Systems Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Autonomous vehicles are expected to play a key role in rescue and transportation. Planning an optimal path with the minimum computational effort for these vehicles in their missions improves their efficiency and adds safety for the vehicles and third parties on the ground. The objective of this thesis is to study the computational effort of four planning methods that implement linear temporal logic (LTL) to translate the high-level mission requirements and environmental specifications. The Potential Field Method and the Critical Path method required less computational effort to find one of the shortest paths for the mission The Multigraph Network Planning method and the Critical Path method can find all the possible paths with predetermined path length. The Random Walk method required more computational effort and memory compared to the other three methods.Item Towards Fast and Efficient Representation Learning(2018) Li, Hao; Samet, Hanan; Goldstein, Thomas; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The success of deep learning and convolutional neural networks in many fields is accompanied by a significant increase in the computation cost. With the increasing model complexity and pervasive usage of deep neural networks, there is a surge of interest in fast and efficient model training and inference on both cloud and embedded devices. Meanwhile, understanding the reasons for trainability and generalization is fundamental for its further development. This dissertation explores approaches for fast and efficient representation learning with a better understanding of the trainability and generalization. In particular, we ask following questions and provide our solutions: 1) How to reduce the computation cost for fast inference? 2) How to train low-precision models on resources-constrained devices? 3) What does the loss surface looks like for neural nets and how it affects generalization? To reduce the computation cost for fast inference, we propose to prune filters from CNNs that are identified as having a small effect on the prediction accuracy. By removing filters with small norms together with their connected feature maps, the computation cost can be reduced accordingly without using special software or hardware. We show that simple filter pruning approach can reduce the inference cost while regaining close to the original accuracy by retraining the networks. To further reduce the inference cost, quantizing model parameters with low-precision representations has shown significant speedup, especially for edge devices that have limited computing resources, memory capacity, and power consumption. To enable on-device learning on lower-power systems, removing the dependency of full-precision model during training is the key challenge. We study various quantized training methods with the goal of understanding the differences in behavior, and reasons for success or failure. We address the issue of why algorithms that maintain floating-point representations work so well, while fully quantized training methods stall before training is complete. We show that training algorithms that exploit high-precision representations have an important greedy search phase that purely quantized training methods lack, which explains the difficulty of training using low-precision arithmetic. Finally, we explore the structure of neural loss functions, and the effect of loss landscapes on generalization, using a range of visualization methods. We introduce a simple filter normalization method that helps us visualize loss function curvature, and make meaningful side-by-side comparisons between loss functions. The sharpness of minimizers correlates well with generalization error when this visualization is used. Then, using a variety of visualizations, we explore how training hyper-parameters affect the shape of minimizers, and how network architecture affects the loss landscape.Item Towards Generalized Frameworks for Object Recognition(2018) SANTHANAM, VENKATARAMAN; Davis, Larry S.; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Over the past few years, deep convolutional neural network (DCNN) based approaches have been immensely successful in tackling a diverse range of object recognition problems. Popular DCNN architectures like deep residual networks (ResNets) are highly generic, not just for classification, but also for high level tasks like detection/tracking which rely on classification DCNNs as their backbone. The generality of DCNNs however doesn't extend to image-to-image(Im2Im) regression tasks (eg: super-resolution, denoising, rgb-to-depth, relighting, etc). For such tasks, DCNNs are often highly task-specific and require specific ancillary post-processing methods. The major issue plaguing the design of generic architectures for such tasks is the tradeoff between context/locality given a fixed computation/memory budget. We first present a generic DCNN architecture for Im2Im regression that can be trained end-to-end without any further machinery. Our proposed architecture, the Recursively Branched Deconvolutional Network (RBDN), which features a cheap early multi-context image representation, an efficient recursive branching scheme with extensive parameter sharing and learnable upsampling. We provide qualitative/quantitative results on 3 diverse tasks: relighting, denoising and colorization and show that our proposed RBDN architecture obtains comparable results to the state-of-the-art on each of these tasks when used off-the-shelf without any post processing or task-specific architectural modifications. Second, we focus on gradient flow and optimization in ResNets. In particular, we theoretically analyze why pre-activation(v2) ResNets outperform the original ResNets(v1) on CIFAR datasets but not on ImageNet. Our analysis reveals that although v1-ResNets lack ensembling properties, they can have a higher effective depth in comparison to v2-ResNes. Subsequently, we show that downsampling projections (while only few in number) have a significantly detrimental effect on performance. We show that by simply replacing downsampling-projections with identity-like dense-reshape shortcuts, the classification results of standard residual architectures like ResNets, ResNeXts and SE-Nets improve by up to 1.2% on ImageNet, without any increase in computational complexity (FLOPs). Finally, we present a robust non-parametric probabilistic ensemble method for multi-classification, which outperforms the state-of-the-art ensemble methods on several machine learning and computer vision datasets for object recognition with statistically significant improvements. The approach is particularly geared towards multi-classification problems with very low training data and/or a fairly high proportion of outliers, for which training end-to-end DCNNs is not very beneficial.Item Fast optimization methods for machine learning, and game-theoretic models of cultural evolution(2018) De, Soham; Nau, Dana S; Goldstein, Thomas; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)This thesis has two parts. In the first part, we explore fast stochastic optimization methods for machine learning. Mathematical optimization is a backbone of modern machine learning. Most machine learning problems require optimizing some objective function that measures how well a model matches a data set, with the intention of drawing patterns and making decisions on new unseen data. The success of optimization algorithms in solving these problems is critical to the success of machine learning, and has enabled the research community to explore more complex machine learning problems that require bigger models and larger datasets. Stochastic gradient descent (SGD) has become the standard optimization routine in machine learning, and in particular in deep neural networks, due to its impressive performance across a wide variety of tasks and models. SGD, however, can often be slow for neural networks with many layers and typically requires careful user oversight for setting hyperparameters properly. While innovations such as batch normalization and skip connections have helped alleviate some of these issues, why such innovations are required eludes full understanding, and it is worthwhile to gain deeper theoretical insights into these problems and to consider more advanced optimization methods specifically tailored towards training large complex models. In this part of the thesis, we review and analyze some of the recent progress made in this direction, and develop new optimization algorithms that are provably fast, significantly easier to train, and require less user oversight. Then, we will discuss the theory of quantized networks, which use low-precision weights to compress and accelerate neural networks, and when/why they are trainable. Finally, we discuss some recent results on how the convergence of SGD is affected by the architecture of neural nets, and we show using theoretical analysis that wide networks train faster than narrow nets, and deeper networks train slower than shallow nets - an effect often observed in practice. In the second part of the thesis, we study the evolution of cultural norms in human societies using game-theoretic models, drawing from research in cross-cultural psychology. Understanding human behavior and modeling how cultural norms evolve in different human societies is vital for designing policies and avoiding conflicts around the world. In this part, we explore ways to use computational game-theoretic techniques, and in particular evolutionary game-theoretic (EGT) models, to gain insight into why different human societies have different norms and behaviors. We first describe an evolutionary game-theoretic model to study how norms change in a society, based on the idea that different strength of norms in societies translate to different game-theoretic interaction structures and incentives. We identify conditions that determine when societies change their existing norms, when they are resistant to such change, and how this depends on the strength of norms in a society. Next, we extend this study to analyze the evolutionary relationships between the tendency to conform and how quickly a population reacts when conditions make a change in norm desirable. Our analysis identifies conditions when a tipping point is reached in a population, causing norms to change rapidly. Next we study conditions that affect the existence of group-biased behavior among humans (i.e., favoring others from the same group, and being hostile towards others from different groups). Using an evolutionary game-theoretic model, we show that out-group hostility is dramatically reduced by mobility. Technological and societal advances over the past centuries have greatly increased the degree to which humans change physical locations, and our results show that in highly mobile societies, one's choice of action is more likely to depend on what individual one is interacting with, rather than the group to which the individual belongs.