Understanding and Modeling Explicit and Implicit Representations of the Visual World

Loading...
Thumbnail Image

Publication or External Link

Date

Advisor

Shrivastava, Abhinav

Citation

Abstract

While a standard megapixel image might be worth a thousand words, it is at the same time worth more than a million pixels. Understanding the local and global structures of these pixels, and their meaning, has been a core problem since the birth of computer vision, enabling tasks like classification and image generation. Storing such massive amounts of pixels is another issue, with hundreds of million of terabytes of new data created every day motivating the development of powerful compression algorithms. In my research, I aim to tackle both of these problems by modeling the visual world with two main paradigms: training neural networks that model images explicitly with representation vectors, and training networks that model images implicitly in their own weights.

In this thesis, I first discuss my work around explicit unsupervised image representation and multimodal understanding. I describe a benchmark where I compare learned representations in terms of both their downstream task performance as well as the relationships between the embeddings themselves. I create a pipeline for generating synthetic text data to help perform better benchmarking and training of multimodal models for long video understanding.

Second, I investigate diffusion as a way to train a potential unified model for explicit image representations, one that can perform both recognition and generative tasks. I explore the capacity of pre-trained diffusion networks for recognition tasks and present a lightweight, learnable feedback mechanism to improve the performance. I then adapt this feedback mechanism for fast, higher-quality image generation.

Finally, I discuss an alternative paradigm for image understanding -- implicit neural representation. I provide an overview of this area, including my works for video compression. I also present a framework for better understanding what these models learn. I benchmark these models, distill best principles for their design, and propose more optimal designs that prioritize encoding time as well as quality-space tradeoffs. I improve hyper-networks, which predict implicit model weights from video inputs, to enable better real-time video compression.

Notes

Rights