Towards Immersive Visual Content with Machine Learning
Files
Publication or External Link
Date
Authors
Advisor
Citation
Abstract
Extended reality technology stands poised to revolutionize how we perceive, learn, and engage with our environment. However, transforming data captured in the physical world into digital content for immersive experiences continues to pose challenges. In this dissertation, I present my research on employing machine learning algorithms to enhance the generation and representation of immersive visual data.
Firstly, I address the issue of recovering depth information from videos captured using 360-degree cameras. I propose a novel technique that unifies the representation of object depth and surface normal utilizing double quaternions. Experimental results demonstrate that training with a double-quaternion-based loss function improves the prediction accuracy of a neural network using 360-degree video frames as input.
Secondly, I examine the problem of efficiently representing 4D light fields using the emerging concept of neural fields. Light fields hold significant potential for immersive visual applications; however, their widespread adoption is hindered by the substantial cost associated with storing and transmitting such high-dimensional data. I propose a novel approach for representing light fields. Deviating from previous approaches, I treat the light field data as a mapping function from pixel coordinates to color and trains a neural network to accurately learn this mapping function. This functional representation enables high-quality interpolation and super-resolution on light fields while achieving state-of-the-art results in light field compression.
Thirdly, I present neural subspaces for light fields. I adapt the ideas of subspace learning and tracking and identify the conceptual relationship between neural representations of light fields and the framework of subspace learning. My method considers a light field as an aggregate of local segments or multiple local neural subspaces. A set of local neural networks are trained to encode each subset of viewpoints. Since each local network specializes in a specific region, this specialization allows for smaller networks without compromising accuracy.
Fourthly, I introduce primary ray-based implicit function to represent geometric shapes. Traditional implicit shape representations, such as signed distance function, describe a shape by its relationship to each spatial point. Such a point-based representation of shapes often necessitates the costly iterative sphere tracing to render a surface hit point. I propose a ray-based approach to implicit shape modeling, wherein the shape is implicitly described by its relationship with each ray in the 3D space. To render the hit point, my method only requires a single inference pass, considerably reducing the computational cost of rendering.
Lastly, I describe a technique to generate novel view rendering without relying on any 3D structure or camera pose information. I harness the power of neural fields to encode individual images without estimating their camera poses. My method learns a latent code for each image in the multi-view collection, and then produces plausible and photorealistic novel view renderings by interpolating their latent codes. This entirely 3D-agnostic approach avoids the computational cost incurred by 3D representations, offering a promising outlook on employing image-based neural fields for image manipulation tasks beyond fitting and super-resolving known images.