IMPROVING EFFICIENCY AND SCALABILITY IN VISUAL SURVEILLANCE APPLICATIONS
Davis, Larry S
MetadataShow full item record
We present four contributions to visual surveillance: (a) an action recognition method based on the characteristics of human motion in image space; (b) a study of the strengths of five regression techniques for monocular pose estimation that highlights the advantages of kernel PLS; (c) a learning-based method for detecting objects carried by humans requiring minimal annotation; (d) an interactive video segmentation system that reduces supervision by using occlusion and long term spatio-temporal structure information. We propose a representation for human actions that is based solely on motion information and that leverages the characteristics of human movement in the image space. The representation is best suited to visual surveillance settings in which the actions of interest are highly constrained, but also works on more general problems if the actions are ballistic in nature. Our computationally efficient representation achieves good recognition performance on both a commonly used action recognition dataset and on a dataset we collected to simulate a checkout counter. We study discriminative methods for 3D human pose estimation from single images, which build a map from image features to pose. The main difficulty with these methods is the insufficiency of training data due to the high dimensionality of the pose space. However, real datasets can be augmented with data from character animation software, so the scalability of existing approaches becomes important. We argue that Kernel Partial Least Squares approximates Gaussian Process regression robustly, enabling the use of larger datasets, and we show in experiments that kPLS outperforms two state-of-the-art methods based on GP. The high variability in the appearance of carried objects suggests using their relation to the human silhouette to detect them. We adopt a generate-and-test approach that produces candidate regions from protrusion, color contrast and occlusion boundary cues and then filters them with a kernel SVM classifier on context features. Our method exceeds state of the art accuracy and has good generalization capability. We also propose a Multiple Instance Learning framework for the classifier that reduces annotation effort by two orders of magnitude while maintaining comparable accuracy. Finally, we present an interactive video segmentation system that trades off a small amount of segmentation quality for significantly less supervision than necessary in systems in the literature. While applications like video editing could not directly use the output of our system, reasoning about the trajectories of objects in a scene or learning coarse appearance models is still possible. The unsupervised segmentation component at the base of our system effectively employs occlusion boundary cues and achieves competitive results on an unsupervised segmentation dataset. On videos used to evaluate interactive methods, our system requires less interaction time than others, does not rely on appearance information and can extract multiple objects at the same time.