Data and Methods for Reference Resolution in Different Modalities
MetadataShow full item record
One foundational goal of artificial intelligence is to build intelligent agents which interact with humans, and to do so, they must have the capacity to infer from human communication what concept is being referred to in a span of symbols. They should be able, like humans, to map these representations to perceptual inputs, visual or otherwise. In NLP, this problem of discovering which spans of text are referring to the same real-world entity is called Coreference Resolution. This dissertation expands this problem to go beyond text and maps concepts referred to by text spans to concepts represented in images. This dissertation also investigates the complex and hard nature of real world coreference resolution. Lastly, this dissertation expands upon the definition of references to include abstractions referred by non-contiguous text distributions. A central theme throughout this thesis is the paucity of data in solving hard problems of reference, which it addresses by designing several datasets. To investigate hard text coreference this dissertation analyses a domain of coreference heavy text, namely questions present in the trivia game of quiz bowl and creates a novel dataset. Solving quiz bowl questions requires robust coreference resolution and world knowledge, something humans possess but current models do not. This work uses distributional semantics for world knowledge. Also, this work addresses the sub-problems of coreference like mention detection. Next, to investigate complex visual representations of concepts, this dissertation uses the domain of paintings. Mapping spans of text in descriptions of paintings to regions of paintings being described by that text is a non-trivial problem because paintings are sufficiently harder than natural images. Distributional semantics are again used here. Finally, to discover prototypical concepts present in distributed rather than contiguous spans of text, this dissertation investigates a source which is rich in prototypical concepts, namely movie scripts. All movie narratives, character arcs, and character relationships, are distilled to sequences of interconnected prototypical concepts which are discovered using unsupervised deep learning models, also using distributional semantics. I conclude this dissertation by discussing potential future research in downstream tasks which can be aided by discovery of referring multi-modal concepts.