Leveraging Structure in Activity Recognition: Context and Spatiotemporal Dynamics

dc.contributor.advisorDavis, Larry Sen_US
dc.contributor.authorKhamis, Samehen_US
dc.contributor.departmentComputer Scienceen_US
dc.contributor.publisherDigital Repository at the University of Marylanden_US
dc.contributor.publisherUniversity of Maryland (College Park, Md.)en_US
dc.description.abstractActivity recognition is one of the fundamental problems of computer vision. An activity recognition system aims to identify the actions of humans from an image or a video. This problem has been historically approached in isolation, and typically as part of a multi-stage system, where tracking for instance is another part. However, recent work sheds light on how activity recognition is in fact entangled with other fundamental problems in the field. Tracking is one such instance, where the identity of each person is maintained across a video sequence. Scene classification is another example, where scene properties are identified from image data. Affordance reasoning is yet another, where the objects in the scene are assigned labels representing what types of actions can be performed upon them. In this thesis we build a joint formulation for activity recognition, modeling the aforementioned coupled problems as latent variables. Optimizing the objective function for this formulation allows us to recover a more accurate solution to activity recognition and simultaneously solutions to problems like tracking or scene classification. We first introduce a model that jointly solves tracking and activity recognition from videos. Instead of establishing tracks in a preprocessing step, the model solves a joint optimization problem, recovering actions and identities for every person in a video sequence. We then extend this model to include frame-level cues, where activity labels assigned to people in the same scene are inter-compatible through a scene-level label. In the second half of the thesis we look at an alternative formulation of the same problem, based on probabilistic logic. This new model leverages the same cues, temporal and spatial, through soft logic rules. This joint formulation can be efficiently solved, recovering both action labels and tracks. We finally introduce another model that reformulates action recognition in the multi-label setting, where each person can be performing more than one action at the same time. In this setting, a joint formulation can solve for all the likely actions of a person through explicit modeling of action label correlations. Finally, we conclude with a discussion of several challenges and how they can motivate viable future extensions.en_US
dc.subject.pqcontrolledComputer scienceen_US
dc.subject.pqcontrolledArtificial intelligenceen_US
dc.subject.pqcontrolledComputer engineeringen_US
dc.subject.pquncontrolledActivity Recognitionen_US
dc.subject.pquncontrolledComputer Visionen_US
dc.subject.pquncontrolledMachine Learningen_US
dc.subject.pquncontrolledScene Understandingen_US
dc.titleLeveraging Structure in Activity Recognition: Context and Spatiotemporal Dynamicsen_US


Original bundle
Now showing 1 - 2 of 2
Thumbnail Image
3.52 MB
Adobe Portable Document Format
No Thumbnail Available
1.95 MB
Unknown data format