Towards solving recognition and detection problems in real life

Loading...
Thumbnail Image

Files

Wu_umd_0117E_20183.pdf (17.95 MB)
(RESTRICTED ACCESS)
No. of downloads:

Publication or External Link

Date

2019

Authors

Citation

Abstract

Recognition and detection are essential topics in video and image analysis, especially in applications towards real-life setting. There are lots of challenges that we need to solve from (a) various background conditions that could bury essential recognition cues, such as illumination, occlusion, poor data recording condition, etc. to (b) imperfect data annotation that can misguide the classifier and trap the learning process, and (c) intentionally image editing and filtering that make the image appealing but the trained classifiers easy to fail. The former type poses a great challenge to the recognition system, since finding the discriminative evidences from background noises is like finding a needle in a haystack. Missing annotation or incorrect annotation is inevitable during data annotation procedure. In the era of Big Data, minimizing the effect of this type of noise becomes more essential. For the artistic image filtering, we need to learn the stylization information from a limited number of filtered samples and make the trained model robust to various appearance

transform. In this dissertation, we specifically study three types of visual problems in three challenging applications towards real-life setting.

First, we study the deception detection in videos. We propose an automated approach for deception detection in real-life trial videos. Mining the subtle cue of deception is dicult due to its covert nature. At the mean time, we need to handle the background noises from the unconstrained setting. To solve this problem, we build a multi-modal system that takes into account three different modalities (visual, audio and text). On the vision side, our system uses classifiers trained on low level video features which predict human micro-expressions. We show that predictions of high-level micro-expressions can be used as features for deception prediction. MFCC (Mel-frequency Cepstral Coecients) features from the audio domain also provide a significant boost in performance, while information from transcripts is not very beneficial for our system. We demonstrate the effectiveness of utilizing multiple modalities than each single modality. We also present results of a user-study to analyze how well do average humans perform on this task, what modalities they use for deception detection and how they perform if only one modality is accessible.

Besides, most work on automated deception detection (ADD) in video has two restrictions: (i) it focuses on a video of one person, and (ii) it focuses on a single act of deception in a one or two minute video. As an extension, we propose a new ADD framework which captures long term deception in a group setting. We study deception in the well-known Resistance game (like Mafia and Werewolf) which con- sists of 5-8 players of whom 2-3 are spies. Spies are deceptive throughout the game (typically 30-65 minutes) to keep their identity hidden. We develop an ensemble predictive model to identify spies in Resistance videos. We show that features from low-level and high-level video analysis are insucient, but when combined with a new class of features that we call spyrank, produce the best results. We achieve AUCs of over 0.70 in a fully automated setting.

Second, we study missing annotation problem in the task of object detection. Missing annotation is an inevitable issue for large scale datasets. In this setting, the unlabeled object instances will be treated as background, which will generate an incorrect training signal for the detector. Interestingly, through a preliminary study, we observe that after dropping 30% of the annotations (and labeling them as background), the performance of CNN-based object detectors like Faster-RCNN only drops by 5% on the PASCAL VOC dataset. We provide a detailed explanation for this result. To further bridge the performance gap, we propose a simple yet effective solution, called Soft Sampling. Soft Sampling re-weights the gradients of RoIs as a function of overlap with positive instances. This ensures that the uncertain background regions are given a smaller weight compared to the hard- negatives. Extensive experiments on curated PASCAL VOC datasets demonstrate the effectiveness of the proposed Soft Sampling method at di↵erent annotation drop rates. Finally, we show that on OpenImagesV3, which is a real-world dataset with missing annotations, Soft Sampling outperforms standard detection baselines by over 3%. It was also included in the top performing entries in the OpenImagesV4 challenge conducted during ECCV 2018.

Last, deep neural networks have been shown to suffer from poor generalization when small perturbations are added (like Gaussian noise), yet little work has been done to evaluate their robustness to more natural image transformations like photo filters. This chapter presents a study on how popular pretrained models are affected by commonly used Instagram filters. To this end, we introduce ImageNet-Instagram, a filtered version of ImageNet, where 20 popular Instagram filters are applied to each image in ImageNet. Our analysis suggests that simple structure preserving filters which only alter the global appearance of an image can lead to large differences in the convolutional feature space. To improve generalization, we introduce a lightweight de-stylization module that predicts parameters used for scaling and shifting feature maps to “undo” the changes incurred by filters, inverting the process of style trans- fer tasks. We further demonstrate the module can be readily plugged into modern CNN architectures together with skip connections. We conduct extensive studies on ImageNet-Instagram, and show quantitatively and qualitatively, that the proposed module, among other things, can effectively improve generalization by simply learn- ing normalization parameters without retraining the entire network, thus recovering the alterations in the feature space caused by the filters.

Notes

Rights