Understanding Events from Streaming Video - Joint Deep and Graph Representations, Commonsense Priors, and Predictive Learning
Most successful computer vision solutions to aspects of event understanding have been through strongly supervised deep learning approaches, often requiring large amounts of manual annotations both at high-level and at image-level. There is an implicit closed world assumption in these approaches. Methods are only now emerging in computer vision towards open-world algorithms and need little or no supervision. Our proposal pushes this frontier of computer vision by coupling the self-supervised learning process with prior knowledge. Perception experiments and models suggest that humans build event description without the need for extensive supervision. The goal of this project is to formulate, inspired by reviews of the human brain and perception, a computer vision-based event understanding system that operates in a self-supervised, streaming fashion to predict and detect old and new events, learns to build hierarchical event representations, in the context of a prior knowledge-base that is updated over time. The intent is to generate interpretations of an event rather than just recognition.
- Award numbers: 1956050 1955154 1955230
- Duration: 4 years
- Award amount: $1,005,543.00
- PIs: Sudeep Sarkar (USF), Anuj Srivastava (FSU), Sathyanarayanan Aakur (OSU)
- Other participants: Aditi Basu Bal (FSU), Ahmed Shahbaz (USF), Mauricio Pamplona Segundo (USF), Sanjoy Kundu (OSU), Shenyuan Liang (FSU), Shubham Trehan (OSU), Ramy Mounir (USF)
- International Collaborators: Roman Gula and Jörn Theuerkauf, Museum and Institute of Zoology, Poland Academy of Sciences, Poland, Saket Anand, IIIT Delhi, India, Richa Singh, IIT Jodhpur, India
- Point of Contact: Sudeep Sarkar
- Date of Last Update: Jul 26, 2022
Research Challenges
(I) The integration of function-based continuous and energy-based symbolic representations
(II) The use of commonsense knowledge bases, such as such as ConceptNet, to beyond the closed world of training data semantics
(III) The use of predictive self-supervised learning over both continuous and symbolic representations
(IV) The development of dynamical models on graph manifolds that allow for generative modeling of graph structures for prediction and discovery of new concepts
Project Goals
(I) Integrate deep learning with pattern theory to reason simultaneously over continuous-values vectors and discrete symbolic representations.
(II) Formulate how to predict the next description using dynamical models over graphs for the high level.
(III) Formulate concept-contextualization by building upon attention-based ideas on geometric deep-learning, but on knowledge-graphs (ConceptNet), and coupling it with the concept of equilibrium propagation to work with energy-based pattern theory models.
(IV) Formulate graph-based mechanisms for the discovery of new concepts based on the existence of significant prediction error signals and the pattern of graph similarities over past event models.
(V) Create a Video-Visual Genome dataset that can be used to benchmark continuous learning algorithms. This dataset will be associated with appropriate metrics and graph-based ground-truth specifications, much like the Visual Genome Project, but for long-term videos.
Current Results


Broader Impacts
Publications
A. Basu Bal, A. Srivastava, S. Aakur, R. Mounir, S. Sarkar (2022). "Time-Series Analysis of Video Graphs Using Joint Kalman Smoothing and Registration", European Conference on Computer Vision.
S. Aakur and S. Sarkar (2022). "Actor-centered Representations for Action Localization in Streaming Videos", European Conference on Computer Vision.
R. Mounir, A. Shahabaz, R. Gula, J. Theuerkauf, and S. Sarkar (2022). "Towards Automated Ethogramming: Cognitively-Inspired Event Segmentation for Streaming Wildlife Video Monitoring", CVPR Workshop on CV4Animals: Computer Vision for Animal Behavior Tracking and Modeling.
R. Mounir, R. Gula, J. Theuerkauf, and S. Sarkar (2021). "Spatio-Temporal Event Segmentation for Wildlife Extended Videos", IAPR International Conference on Computer Vision and Image Processing.
R. Mounir, S. M. Aakur, and S. Sarkar (2021). "Self-supervised temporal event segmentation inspired by cognitive theories", Advanced Methods in Deep Learning in Computer Vision, Elsevier.
R. Mounir, A. Shahabaz, R. Gula, J. Theuerkauf, and S. Sarkar (2021). "Spatio-Temporal Event Segmentation for Wildlife Extended Videos", CVPR Workshop on CV4Animals: Computer Vision for Animal Behavior Tracking and Modeling.
Acknowledgement & Disclaimer
This material is based upon work supported by the National Science Foundation under Grant No. 1956050/1955154/1955230. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.