VISION AND NATURAL LANGUAGE FOR CREATIVE APPLICATIONS, AND THEIR ANALYSIS

Manjunatha, Varun

VISION AND NATURAL LANGUAGE FOR CREATIVE APPLICATIONS, AND THEIR ANALYSIS

dc.contributor.advisor	Davis, Larry	en_US
dc.contributor.author	Manjunatha, Varun	en_US
dc.contributor.department	Computer Science	en_US
dc.contributor.publisher	Digital Repository at the University of Maryland	en_US
dc.contributor.publisher	University of Maryland (College Park, Md.)	en_US
dc.date.accessioned	2019-06-22T05:32:14Z
dc.date.available	2019-06-22T05:32:14Z
dc.date.issued	2018	en_US
dc.description.abstract	Recent advances in machine learning, specifically problems in Computer Vision and Natural Language, have involved training deep neural networks with enormous amounts of data. The first frontier for deep networks was in uni-modal classification and detection problems (which were directed more towards ”intelligent robotics” and surveillance applications), while the next wave involves deploying deep networks on more creative tasks and common-sense reasoning. We provide two applications of these, interspersed by an analysis on these deep models. Automatic colorization is the process of adding color to greyscale images. We condition this process on language, allowing end users to manipulate a colorized image by feeding in different captions. We present two different architectures for language-conditioned colorization, both of which produce more accurate and plausible colorizations than a language-agnostic version. Through this language-based framework, we can dramatically alter colorizations by manipulating descriptive color words in captions. Researchers have observed that Visual Question Answering(VQA) models tend to answer questions by learning statistical biases in the data. (for example, the answer to the question “What is the color of the sky?” is usually “Blue”). It is of interest to the community to explicitly discover such biases, both for understanding the behavior of such models, and towards debugging them. In a database, we store the words of the question, answer and visual words corresponding to regions of interest in attention maps. By running simple rule mining algorithms on this database, we discover human-interpretable rules which give us great insight into the behavior of such models. Our results also show examples of unusual behaviors learned by the model in attempting VQA tasks. Visual narrative is often a combination of explicit information and judicious omissions, relying on the viewer to supply missing details. In comics, most movements in time and space are hidden in the gutters between panels. To follow the story, readers logically connect panels together by inferring unseen actions through a process called closure. While computers can now describe what is explicitly depicted in natural images, in this paper we examine whether they can understand the closure-driven narratives conveyed by stylized artwork and dialogue in comic book panels. We construct a dataset, COMICS, that consists of over 1.2 million panels (120 GB) paired with automatic textbox transcriptions. An in-depth analysis of COMICS demonstrates that neither text nor image alone can tell a comic book story, so a computer must understand both modalities to keep up with the plot. We introduce three cloze-style tasks that ask models to predict narrative and character-centric aspects of a panel given n preceding panels as context. Various deep neural architectures underperform human baselines on these tasks, suggesting that COMICS contains fundamental challenges for both vision and language. For many NLP tasks, ordered models, which explicitly encode word order information, do not significantly outperform unordered (bag-of-words) models. One potential explanation is that the tasks themselves do not require word order to solve. To test whether this explanation is valid, we perform several time-controlled human experiments with scrambled language inputs. We compare human accuracies to those of both ordered and unordered neural models. Our results contradict the initial hypothesis, suggesting instead that humans may be less robust to word order variation than computers.	en_US
dc.identifier	https://doi.org/10.13016/usdr-s1lr
dc.identifier.uri	http://hdl.handle.net/1903/22159
dc.language.iso	en	en_US
dc.subject.pqcontrolled	Computer science	en_US
dc.subject.pquncontrolled	Colorization	en_US
dc.subject.pquncontrolled	Computer Vision	en_US
dc.subject.pquncontrolled	Deep Learning	en_US
dc.subject.pquncontrolled	Machine Learning	en_US
dc.subject.pquncontrolled	Visual Question Answering	en_US
dc.title	VISION AND NATURAL LANGUAGE FOR CREATIVE APPLICATIONS, AND THEIR ANALYSIS	en_US
dc.type	Dissertation	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Manjunatha_umd_0117E_19610.pdf
Size:: 29.93 MB
Format:: Adobe Portable Document Format

Download

Collections

UMD Theses and Dissertations
Computer Science Theses and Dissertations