SCENE AND ACTION UNDERSTANDING USING CONTEXT AND KNOWLEDGE SHARING
Files
Publication or External Link
Date
Authors
Citation
DRUM DOI
Abstract
Complete scene understanding from video data involves spatio-temporal decision making over long sequences and utilization of world knowledge. We propose a method that captures edge connections between these spatio-temporal components or knowledge graphs through a graph convolution network (GCN). Our approach uses the GCN to fuse various information in the video like detected objects, human pose, scene information etc. for action segmentation. For certain functions like zero shot and few shot action recognition, we learn a classifier for unseen test classes through comparison with similar training classes. We provide information about similarity between two classes through an explicit relationship map i.e. the knowledge graph. We study different kinds of knowledge graphs based on action phrases, verbs or nouns and visual features to demonstrate how they perform with respect to each other. We build an integrated approach for zero-shot and few-shot learning. We also show further improvements through adaptive learning of the input knowledge graphs and using triplet loss along with the task specific loss while training. We add results for semi-supervised learning as well to understand improvements from our graph learning technique.
For complete scene understanding, we also study depth completion using deep depth prior based on the deep image prior (DIP) technique. DIP shows that structure of convolutional neural networks (CNNs) induces a strong prior that favors natural images. Given color images and noisy or incomplete target depth maps, we optimize a randomly-initialized CNN model to reconstruct a depth map restored by virtue of using the CNN network structure as a prior combined with a view-constrained photo-consistency loss. This loss is computed using images from a geometrically calibrated camera from nearby viewpoints. It is based on test time optimization, so it is independent of training data distributions. We apply this deep depth prior for inpainting and refining incomplete and noisy depth maps within both binocular and multi-view stereo pipelines.