SCENE AND ACTION UNDERSTANDING USING CONTEXT AND KNOWLEDGE SHARING

dc.contributor.advisorDavis, Larry Sen_US
dc.contributor.advisorShrivastava, Abhinaven_US
dc.contributor.authorGHOSH, PALLABIen_US
dc.contributor.departmentComputer Scienceen_US
dc.contributor.publisherDigital Repository at the University of Marylanden_US
dc.contributor.publisherUniversity of Maryland (College Park, Md.)en_US
dc.date.accessioned2021-02-14T06:36:48Z
dc.date.available2021-02-14T06:36:48Z
dc.date.issued2020en_US
dc.description.abstractComplete scene understanding from video data involves spatio-temporal decision making over long sequences and utilization of world knowledge. We propose a method that captures edge connections between these spatio-temporal components or knowledge graphs through a graph convolution network (GCN). Our approach uses the GCN to fuse various information in the video like detected objects, human pose, scene information etc. for action segmentation. For certain functions like zero shot and few shot action recognition, we learn a classifier for unseen test classes through comparison with similar training classes. We provide information about similarity between two classes through an explicit relationship map i.e. the knowledge graph. We study different kinds of knowledge graphs based on action phrases, verbs or nouns and visual features to demonstrate how they perform with respect to each other. We build an integrated approach for zero-shot and few-shot learning. We also show further improvements through adaptive learning of the input knowledge graphs and using triplet loss along with the task specific loss while training. We add results for semi-supervised learning as well to understand improvements from our graph learning technique. For complete scene understanding, we also study depth completion using deep depth prior based on the deep image prior (DIP) technique. DIP shows that structure of convolutional neural networks (CNNs) induces a strong prior that favors natural images. Given color images and noisy or incomplete target depth maps, we optimize a randomly-initialized CNN model to reconstruct a depth map restored by virtue of using the CNN network structure as a prior combined with a view-constrained photo-consistency loss. This loss is computed using images from a geometrically calibrated camera from nearby viewpoints. It is based on test time optimization, so it is independent of training data distributions. We apply this deep depth prior for inpainting and refining incomplete and noisy depth maps within both binocular and multi-view stereo pipelines.en_US
dc.identifierhttps://doi.org/10.13016/zuhd-qgyl
dc.identifier.urihttp://hdl.handle.net/1903/26825
dc.language.isoenen_US
dc.subject.pqcontrolledComputer scienceen_US
dc.titleSCENE AND ACTION UNDERSTANDING USING CONTEXT AND KNOWLEDGE SHARINGen_US
dc.typeDissertationen_US

Files

Original bundle

Now showing 1 - 1 of 1
Thumbnail Image
Name:
GHOSH_umd_0117E_21259.pdf
Size:
14.4 MB
Format:
Adobe Portable Document Format