Visual Content Synthesis at Scale

dc.contributor.advisorJacobs, Daviden_US
dc.contributor.advisorHuang, Jia-Binen_US
dc.contributor.authorGe, Songweien_US
dc.contributor.departmentComputer Scienceen_US
dc.contributor.publisherDigital Repository at the University of Marylanden_US
dc.contributor.publisherUniversity of Maryland (College Park, Md.)en_US
dc.date.accessioned2025-08-08T12:19:09Z
dc.date.issued2025en_US
dc.description.abstractHumans love to create visual content. Every day, we take photos with smartphones, edit videos using intuitive apps, and create artworks through increasingly accessible digital tools. These widespread practices have led to an explosion of visual data shared continuously on the internet, building massive collections of images and videos that capture diverse human experiences. This enormous accumulation of visual data, together with rapid advancements in GPU computing, has become the foundation for training large-scale generative models, the key to automatically synthesizing top-tier visual content. By learning directly from the rich online visual repositories, these models internalize intricate patterns, styles, and concepts, enabling re-compose these elements to novel samples based on the user's inputs. In this thesis, we study and design scalable generative models that digest and improve with visual data, evaluation metrics that can precisely monitor the progress, and develop applications based on these pre-trained models. This thesis begins by designing frameworks for scalable video generation models. This includes both autoregressive models trained on the discrete tokens obtained through a discrete tokenizer and diffusion models trained directly on the pixels. In addition, we develop a novel video tokenization schema, enabling more compact video representations for larger generative models to train on. Next, we perform a careful analysis of the mainstream automatic evaluation metric. In the last chapter of the thesis, we study several practical scenarios to apply the pre-trained large-scale generative models, with tasks not only generation and beyond the original image and video domains.en_US
dc.identifierhttps://doi.org/10.13016/yl3v-qb8x
dc.identifier.urihttp://hdl.handle.net/1903/34289
dc.language.isoenen_US
dc.subject.pqcontrolledComputer scienceen_US
dc.subject.pqcontrolledArtificial intelligenceen_US
dc.titleVisual Content Synthesis at Scaleen_US
dc.typeDissertationen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Ge_umd_0117E_25157.pdf
Size:
29.32 MB
Format:
Adobe Portable Document Format