AN ANALYSIS OF BOTTOM-UP ATTENTION MODELS AND MULTIMODAL REPRESENTATION LEARNING FOR VISUAL QUESTION ANSWERING

Narayanan, Venkatraman

AN ANALYSIS OF BOTTOM-UP ATTENTION MODELS AND MULTIMODAL REPRESENTATION LEARNING FOR VISUAL QUESTION ANSWERING

dc.contributor.advisor	Shrivastava, Abhinav	en_US
dc.contributor.author	Narayanan, Venkatraman	en_US
dc.contributor.department	Systems Engineering	en_US
dc.contributor.publisher	Digital Repository at the University of Maryland	en_US
dc.contributor.publisher	University of Maryland (College Park, Md.)	en_US
dc.date.accessioned	2020-07-10T05:30:15Z
dc.date.available	2020-07-10T05:30:15Z
dc.date.issued	2019	en_US
dc.description.abstract	A Visual Question Answering (VQA) task is the ability of a system to take an image and an open-ended, natural language question about the image and provide a natural language text answer as the output. The VQA task is a relatively nascent field, with only a few strategies explored. The performance of the VQA system, in terms of accuracy of answers to the image-question pairs, requires a considerable overhaul before the system can be used in practice. The general system for performing the VQA task consists of an image encoder network, a question encoder network, a multi-modal attention network that combines the information obtained image and question, and answering network that generates natural language answers for the image-question pair. In this thesis, we follow two strategies to improve the performance (accuracy) of VQA. The first is a representation learning approach (utilizing the state-of-the-art Generative Adversarial Models (GANs) (Goodfellow, et al., 2014)) to improve the image encoding system of VQA. This thesis evaluates four variants of GANs to identify a GAN architecture that best captures the data distribution of the images, and it was determined that GAN variants become unstable and fail to become a viable image encoding system in VQA. The second strategy is to evaluate an alternative approach to the attention network, using multi-modal compact bilinear pooling, in the existing VQA system. The second strategy led to an increase in the accuracy of VQA by 2% compared to the current state-of-the-art technique.	en_US
dc.identifier	https://doi.org/10.13016/ahta-tfks
dc.identifier.uri	http://hdl.handle.net/1903/26167
dc.language.iso	en	en_US
dc.subject.pqcontrolled	Artificial intelligence	en_US
dc.subject.pqcontrolled	Robotics	en_US
dc.subject.pqcontrolled	Computer science	en_US
dc.subject.pquncontrolled	Artificial Intelligence	en_US
dc.subject.pquncontrolled	Computer Vision	en_US
dc.subject.pquncontrolled	Deep learning	en_US
dc.subject.pquncontrolled	Natural Language Processing	en_US
dc.subject.pquncontrolled	Visual Question Answering	en_US
dc.title	AN ANALYSIS OF BOTTOM-UP ATTENTION MODELS AND MULTIMODAL REPRESENTATION LEARNING FOR VISUAL QUESTION ANSWERING	en_US
dc.type	Thesis	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Narayanan_umd_0117N_20587.pdf
Size:: 1.9 MB
Format:: Adobe Portable Document Format

Download

Collections

UMD Theses and Dissertations
Computer Science Theses and Dissertations