Recognizing Object-Centric Attributes and Relations
Files
Publication or External Link
Date
Authors
Advisor
Citation
DRUM DOI
Abstract
Recognizing an object's visual appearance through its attributes, such as color and shape, and its relations to other objects in an environment, is an innate human ability that allows us to effortlessly interact with the world. This ability remains effective even when humans encounter unfamiliar objects or objects with appearances evolve over time, as humans can still identify them by discerning their attributes and relations. This dissertation aims to equip computer vision systems with this capability, empowering them to recognize object's attributes and relations to become more robust in handling real-world scene complexities. The thesis is structured into two main parts.
The first part focuses on recognizing attributes for objects, an area where existing research is limited to domain-specific attributes or constrained by small-scale and noisy data. We overcome these limitations by introducing a comprehensive dataset for attributes in the wild, marked by challenges with attribute diversity, label sparsity, and data imbalance. To navigate these challenges, we propose techniques that address class imbalance, employ attention mechanism, and utilize contrastive learning for aligning objects with shared attributes. However, as such dataset is expensive to collect, we also develop a framework that leverages large-scale, readily available image-text data for learning attribute prediction. The proposed framework can effectively scale up to predict a larger space of attribute concepts in real-world settings, including novel attributes represented in arbitrary text phrases that are not encountered during training. We showcase various applications of the proposed attribute prediction frameworks, including semantic image search and object image tagging with attributes.
The second part delves into the understanding of visual relations between objects. First, we investigate how the interplay of attributes and relations can improve image-text matching. Moving beyond the computationally expensive cross-attention network of previous studies, we introduce a dual encoder framework using scene graphs that is more efficient yet equally powerful on current image-text retrieval benchmark. Our approach can produce scene graph embeddings rich in attribute and relation semantics, which we show to be useful for image retrieval and image tagging. Lastly, we present our work in training large vision-language models on image-text data for recognizing visual relations. We formulate a new subject-centric approach that predicts multiple relations simultaneously conditioned on a single subject. Our approach is among the first work to learn from both weakly- and strongly-grounded image-text data to predict an extensive range of relationship classes.