UMD Theses and Dissertations

Permanent URI for this collectionhttp://hdl.handle.net/1903/3

New submissions to the thesis/dissertation collections are added automatically as they are received from the Graduate School. Currently, the Graduate School deposits all theses and dissertations from a given semester after the official graduation date. This means that there may be up to a 4 month delay in the appearance of a given thesis/dissertation in DRUM.

More information is available at Theses and Dissertations at University of Maryland Libraries.

Browse

Search Results

Now showing 1 - 2 of 2
  • Item
    Supervision and Data Dynamics in Vision Across Recognition and Generation Landscapes
    (2024) Suri, Saksham; Shrivastava, Abhinav; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    This thesis looks at visual perception through the lens of supervision and data dynamics across recognition and generation landscapes. Generative and discriminative modeling form important pillars in computer vision. Depending on the task techniques to better learn and utilize the data and labels can change. Through this work we investigate different tasks along this landscape focusing on different supervision strategies, highlighting pitfalls in current approaches and propose modified architectures and losses to utilize the data better under different settings. On the recognition side we start by analyzing Vision Transformers (ViTs) through a comprehensive analysis under varied supervision paradigms. We look at a mix of explicit supervision, contrastive self-supervision, and reconstructive self-supervision by delving into attention mechanisms and learned representations. We then look at a more specific case of supervision geared towards object detection which is called sparse supervision where their are missing annotations. We propose to utilize self and semi-supervised techniques to solve this task. Finally, we also explore a discovery style framework with applications on GAN generated image detection. Unlike sparse supervision discussed earlier, this scenario handles the case where are test time we have an unknown number of new classes. We were the first work proposing this problem where instead of just identifying synthetic images, we also try to group them based on their generation source. The exploration of Generative Adversarial Networks (GANs) in an open-world scenario uncovers the intricacies of learning with limited supervision for discovery style problems. On the generation side we delve into different supervision strategies involving decomposing and decoupling representations. In the first work we tackle the problem of paired Image-to-Image (I2I) translation by decomposing supervision into reconstruction and residuals and highlight issues with traditional training approaches. We then look at generating talking head videos through two different kinds of supervision, video and audio. For driving the generation using a video we look at decoupling representations for the task of few-shot talking-head synthesis where the supervision is provided using only a few samples (shots). For this task we factorize the representation into spatial and style components which helps the learning. To supervise the generation additionally through audio, we look at multimodal supervision for lip-synchronized talking head generation. For this we incorporate audio and video modalities to synthesize lifelike talking-heads which can work even in in-the-wild scenarios. In the last part we showcase two works which link our experiences from generation and recognition where we explore generative modeling to improve recognition models. The first work here utilizes the advancements in diffusion based image generation models to improve recognition models. Given the high fidelity and control of generation which diffusion models have brought, we utilize synthetic data from these models and create a suitable pipeline to utilize this data effectively to improve detection and segmentation performance. As a follow up to our ViT analysis we also propose a new technique to utilize off the shelf pretrained ViTs and generate high resolution features using a learnt lightweight feature transform. These high resolution features are especially effective for dense tasks like correspondence, segmentation, detection and object discovery.
  • Thumbnail Image
    Item
    Parsing, Generation and Grammar
    (2016) Momma, Shota Momma; Phillips, Colin; Linguistics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Humans use their grammatical knowledge in more than one way. On one hand, they use it to understand what others say. On the other hand, they use it to say what they want to convey to others (or to themselves). In either case, they need to assemble the structure of sentences in a systematic fashion, in accordance with the grammar of their language. Despite the fact that the structures that comprehenders and speakers assemble are systematic in an identical fashion (i.e., obey the same grammatical constraints), the two ‘modes’ of assembling sentence structures might or might not be performed by the same cognitive mechanisms. Currently, the field of psycholinguistics implicitly adopts the position that they are supported by different cognitive mechanisms, as evident from the fact that most psycholinguistic models seek to explain either comprehension or production phenomena. The potential existence of two independent cognitive systems underlying linguistic performance doubles the problem of linking the theory of linguistic knowledge and the theory of linguistic performance, making the integration of linguistics and psycholinguistic harder. This thesis thus aims to unify the structure building system in comprehension, i.e., parser, and the structure building system in production, i.e., generator, into one, so that the linking theory between knowledge and performance can also be unified into one. I will discuss and unify both existing and new data pertaining to how structures are assembled in understanding and speaking, and attempt to show that the unification between parsing and generation is at least a plausible research enterprise. In Chapter 1, I will discuss the previous and current views on how parsing and generation are related to each other. I will outline the challenges for the current view that the parser and the generator are the same cognitive mechanism. This single system view is discussed and evaluated in the rest of the chapters. In Chapter 2, I will present new experimental evidence suggesting that the grain size of the pre-compiled structural units (henceforth simply structural units) is rather small, contrary to some models of sentence production. In particular, I will show that the internal structure of the verb phrase in a ditransitive sentence (e.g., The chef is donating the book to the monk) is not specified at the onset of speech, but is specified before the first internal argument (the book) needs to be uttered. I will also show that this timing of structural processes with respect to the verb phrase structure is earlier than the lexical processes of verb internal arguments. These two results in concert show that the size of structure building units in sentence production is rather small, contrary to some models of sentence production, yet structural processes still precede lexical processes. I argue that this view of generation resembles the widely accepted model of parsing that utilizes both top-down and bottom-up structure building procedures. In Chapter 3, I will present new experimental evidence suggesting that the structural representation strongly constrains the subsequent lexical processes. In particular, I will show that conceptually similar lexical items interfere with each other only when they share the same syntactic category in sentence production. The mechanism that I call syntactic gating, will be proposed, and this mechanism characterizes how the structural and lexical processes interact in generation. I will present two Event Related Potential (ERP) experiments that show that the lexical retrieval in (predictive) comprehension is also constrained by syntactic categories. I will argue that the syntactic gating mechanism is operative both in parsing and generation, and that the interaction between structural and lexical processes in both parsing and generation can be characterized in the same fashion. In Chapter 4, I will present a series of experiments examining the timing at which verbs’ lexical representations are planned in sentence production. It will be shown that verbs are planned before the articulation of their internal arguments, regardless of the target language (Japanese or English) and regardless of the sentence type (active object-initial sentence in Japanese, passive sentences in English, and unaccusative sentences in English). I will discuss how this result sheds light on the notion of incrementality in generation. In Chapter 5, I will synthesize the experimental findings presented in this thesis and in previous research to address the challenges to the single system view I outlined in Chapter 1. I will then conclude by presenting a preliminary single system model that can potentially capture both the key sentence comprehension and sentence production data without assuming distinct mechanisms for each.