Closing the Gap Between Classification and Retrieval Models

Taha, Ahmed

Closing the Gap Between Classification and Retrieval Models

dc.contributor.advisor	Davis, Larry	en_US
dc.contributor.advisor	Shrivastava, Abhinav	en_US
dc.contributor.author	Taha, Ahmed	en_US
dc.contributor.department	Computer Science	en_US
dc.contributor.publisher	Digital Repository at the University of Maryland	en_US
dc.contributor.publisher	University of Maryland (College Park, Md.)	en_US
dc.date.accessioned	2021-07-07T05:48:20Z
dc.date.available	2021-07-07T05:48:20Z
dc.date.issued	2021	en_US
dc.description.abstract	Retrieval networks learn a feature embedding where similar samples are close together, and different samples are far apart. This feature embedding is essential for computer vision applications such as face/person recognition, zero-shot learn- ing, and image retrieval. Despite these important applications, retrieval networks are less popular compared to classification networks due to multiple reasons: (1) The cross-entropy loss – used with classification networks – is stabler and converges faster compared to metric learning losses – used with retrieval networks. (2) The cross-entropy loss has a huge toolbox of utilities and extensions. For instance, both AdaCos and self-knowledge distillation have been proposed to tackle low sample complexity in classification networks; also, both CAM and Grad-CAM have been proposed to visualize attention in classification networks. To promote retrieval networks, it is important to equip them with an equally powerful toolbox. Accordingly, we propose an evolution-inspired approach to tackle low sample complexity in feature embedding. Then, we propose SVMax to regularize the feature embedding and avoid model collapse. Furthermore, we propose L2-CAF to visualize attention in retrieval networks. To tackle low sample complexity, we propose an evolution-inspired training approach to boost performance on relatively small datasets. The knowledge evolution (KE) approach splits a deep network into two hypotheses: the fit-hypothesis and the reset-hypothesis. We iteratively evolve the knowledge inside the fit-hypothesis by perturbing the reset-hypothesis for multiple generations. This approach not only boosts performance but also learns a slim (pruned) network with a smaller inference cost. KE reduces both overfitting and the burden for data collection. To regularize the feature embedding and avoid model collapse, We propose singular value maximization (SVMax) to promote a uniform feature embedding. Our formulation mitigates model collapse and enables larger learning rates. SV- Max is oblivious to both the input-class (labels) and the sampling strategy. Thus it promotes a uniform feature embedding in both supervised and unsupervised learning. Furthermore, we present a mathematical analysis of the mean singular value’s lower and upper bounds. This analysis makes tuning the SVMax’s balancing- hyperparameter easier when the feature embedding is normalized to the unit circle. To support retrieval networks with a visualization tool, we formulate attention visualization as a constrained optimization problem. We leverage the unit L2-Norm constraint as an attention filter (L2-CAF) to localize attention in both classification and retrieval networks. This approach imposes no constraints on the network architecture besides having a convolution layer. The input can be a regular image or a pre-extracted convolutional feature. The network output can be logits trained with cross-entropy or a space embedding trained with a ranking loss. Furthermore, this approach neither changes the original network weights nor requires fine-tuning. Thus, network performance remains intact. The visualization filter is applied only when an attention map is required. Thus, it poses no computational overhead during inference. L2-CAF visualizes the attention of the last convolutional layer ofGoogLeNet within 0.3 seconds. Finally, we propose a compromise between retrieval and classification networks. We propose a simple, yet effective, two-head architecture — a network with both logits and feature-embedding heads. The embedding head — trained with a ranking loss — limits the overfitting capabilities of the cross-entropy loss by promoting a smooth embedding space. In our work, we leverage the semi-hard triplet loss to allow a dynamic number of modes per class, which is vital when working with imbalanced data. Also, we refute a common assumption that training with a ranking loss is computationally expensive. By moving both the triplet loss sampling and computation to the GPU, the training time increases by just 2%.	en_US
dc.identifier	https://doi.org/10.13016/nul5-lq6a
dc.identifier.uri	http://hdl.handle.net/1903/27316
dc.language.iso	en	en_US
dc.subject.pqcontrolled	Computer science	en_US
dc.subject.pquncontrolled	Computer vision	en_US
dc.subject.pquncontrolled	Deep learning	en_US
dc.subject.pquncontrolled	Feature embedding	en_US
dc.subject.pquncontrolled	Machine learning	en_US
dc.subject.pquncontrolled	Retrieval networks	en_US
dc.title	Closing the Gap Between Classification and Retrieval Models	en_US
dc.type	Dissertation	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Taha_umd_0117E_21466.pdf
Size:: 14 MB
Format:: Adobe Portable Document Format

Download

Collections

UMD Theses and Dissertations
Computer Science Theses and Dissertations