Towards Robust, Interpretable and Scalable Visual Representations

Thumbnail Image


Publication or External Link






Visual representation is one of the central problems in computer vision. The essential problem is to develop a unified representation that effectively encodes both visual appearance and spatial information so that it can be easily applied to various vision applications such as face recognition, image matching, and multimodal image retrieval. Along with the history of computer vision research, there are four major levels of visual representations, i.e., geometric, low-level, mid-level and high-level. The dissertation comprises four works studying effective visual representations in the four different levels. Multiple approaches are proposed with the aim of improving the robustness, interpretability, and scalability of visual representations.

Geometric features are effective in matching images under spatial transformations however their performance is sensitive to the noises. In the first part, we propose to model the uncertainty of geometric representation based on line segments and propose to equip these features with uncertainty modeling so that they could be robustly applied in the image-based geolocation application.

We study in the second part the robustness of feature encoding to noisy keypoints. We show that traditional feature encoding is sensitive to background or noisy features. We propose the Selective Encoding framework which learns the relevance distribution of each codeword and incorporate such information with the original codebook model. Our approach is more robust to the localization errors or uncertainty in the active face authentication application.

The mission of visual understanding is to express and describe the image content which is essentially relating images to human language. That typically involves finding a common representation inferable from both domains of data. In the third part, we propose a framework to extract a mid-level spatial representation directly from language descriptions and match such spatial layouts to the detected object bounding boxes for retrieving indoor scene images from user text queries.

Modern high-level visual features are typically learned from supervised datasets, whose scalability is largely limited by the requirement of dedicated human annotation. In the last part, we propose to learn visual representations from large-scale weakly supervised data for a large number of natural language-based concepts, i.e., n-gram phrases. We propose the differentiable Jelinek-Mercer smoothing loss and train a deep convolutional neural network from images with associated user comments. We show that the learned model can predict a large number of phrase-based concepts from images, can be effectively applied to image-caption applications and transfers well to other visual recognition datasets.