IMAGE AND VIDEO UNDERSTANDING WITH CONSTRAINED RESOURCES

dc.contributor.advisorDavis, Larry Sen_US
dc.contributor.authorWU, ZUXUANen_US
dc.contributor.departmentComputer Scienceen_US
dc.contributor.publisherDigital Repository at the University of Marylanden_US
dc.contributor.publisherUniversity of Maryland (College Park, Md.)en_US
dc.date.accessioned2020-07-07T05:30:38Z
dc.date.available2020-07-07T05:30:38Z
dc.date.issued2020en_US
dc.description.abstractRecent advances in computer vision tasks have been driven by high-capacity deep neural networks, particularly Convolutional Neural Networks (CNNs) with hundreds of layers trained in a supervised manner. However, this poses two significant challenges: (1) the increased depth in CNNs that leads to significant improvements over competitive benchmarks at the same time, limits their deployment in real-world scenarios due to high computational cost, (2) the need to collect millions of human labeled samples for training prevents such approaches to scale, especially for fine-grained image understanding like semantic segmentation, where dense annotations are extremely expensive to obtain. To mitigate these issues, we focus on image and video understanding with constrained resources, in the forms of computational resources and annotation resources. In particular, we present approaches that (1) investigate dynamic computation frameworks which adaptively allocate computing resources on-the-fly given a novel image/video to manage the trade-off between accuracy and computational complexity; (2) derive robust representations with minimal human supervision through exploring context relationships or using shared information across domains. With this in mind, we first introduce BlockDrop, a conditional computation approach that learns to dynamically choose which layers of a deep network to execute during inference so as to best reduce total computation without degrading prediction accuracy. Next, we generalize the idea of conditional computation of images to videos by presenting AdaFrame, a framework that adaptively selects relevant frames on a per-input basis for fast video recognition. AdaFrame assumes access to all frames in videos, and hence can be only used in offline settings. To mitigate this issue, we introduce LiteEval, a simple yet effective coarse-to-fine framework for resource efficient video recognition, suitable for both online and offline scenarios. To derive robust feature representations with limited annotation resources, we first explore the power of spatial context as a supervisory signal for learning visual representations. In addition, we also propose to learn from synthetic data rendered by modern computer graphics tools, where ground-truth labels are readily available. We propose Dual Channel-wise Alignment Networks (DCAN), a simple yet effective approach to reduce domain shift at both pixel-level and feature-level, for unsupervised scene adaptation.en_US
dc.identifierhttps://doi.org/10.13016/uwey-eese
dc.identifier.urihttp://hdl.handle.net/1903/26016
dc.language.isoenen_US
dc.subject.pqcontrolledComputer scienceen_US
dc.subject.pquncontrolledConditional Computationen_US
dc.subject.pquncontrolledImage Recognitionen_US
dc.subject.pquncontrolledVideo Understandingen_US
dc.titleIMAGE AND VIDEO UNDERSTANDING WITH CONSTRAINED RESOURCESen_US
dc.typeDissertationen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
WU_umd_0117E_20659.pdf
Size:
6.44 MB
Format:
Adobe Portable Document Format