Leveraging Structure in Activity Recognition: Context and Spatiotemporal Dynamics

Thumbnail Image


Publication or External Link





Activity recognition is one of the fundamental problems of computer vision. An activity recognition system aims to identify the actions of humans from an image or a video. This problem has been historically approached in isolation, and typically as part of a multi-stage system, where tracking for instance is another part. However, recent work sheds light on how activity recognition is in fact entangled with other fundamental problems in the field. Tracking is one such instance, where the identity of each person is maintained across a video sequence. Scene classification is another example, where scene properties are identified from image data. Affordance reasoning is yet another, where the objects in the scene are assigned labels representing what types of actions can be performed upon them.

In this thesis we build a joint formulation for activity recognition, modeling the aforementioned coupled problems as latent variables. Optimizing the objective function for this formulation allows us to recover a more accurate solution to activity recognition and simultaneously solutions to problems like tracking or scene classification. We first introduce a model that jointly solves tracking and activity recognition from videos. Instead of establishing tracks in a preprocessing step, the model solves a joint optimization problem, recovering actions and identities for every person in a video sequence. We then extend this model to include frame-level cues, where activity labels assigned to people in the same scene are inter-compatible through a scene-level label.

In the second half of the thesis we look at an alternative formulation of the same problem, based on probabilistic logic. This new model leverages the same cues, temporal and spatial, through soft logic rules. This joint formulation can be efficiently solved, recovering both action labels and tracks. We finally introduce another model that reformulates action recognition in the multi-label setting, where each person can be performing more than one action at the same time. In this setting, a joint formulation can solve for all the likely actions of a person through explicit modeling of action label correlations.

Finally, we conclude with a discussion of several challenges and how they can motivate viable future extensions.