Institute for Artificial Intelligence

Event Understanding

Video data captures a tremendous amount of data that encompasses both visual and semantic knowledge. Traditional approaches to video activity understanding is based on training machine learning models or, more recently, a variety of deep learning approaches to capture underlying semantics of the video using human-annotated training data. However, this restricts the trained models to the ontology given by the annotations. A deeper understanding of video activities extends beyond recognition of underlying concepts such as actions and objects: constructing deep semantic representations requires reasoning about the semantic relationships among these concepts, often beyond what is directly observed in the data.


Title: Pattern Theory-based and Commonsense Knowledge for Event Interpretation

Description: We propose an energy minimization framework that leverages large-scale commonsense knowledge bases, such as ConceptNet, to provide contextual cues to establish semantic relationships among entities directly hypothesized from video signal. We mathematically express this using the language of Grenander's canonical pattern generator theory. We show that the use of prior encoded commonsense knowledge alleviate the need for large annotated training datasets and help tackle imbalance in training through prior knowledge. Through extensive experiments, we show that the use of commonsense knowledge from ConceptNet allows the proposed approach to handle various challenges such as training data imbalance, weak features, and complex semantic relationships and visual scenes. We also find that the use of commonsense knowledge allows for highly interpretable models that can be used in a dialog model for better human-machine interaction.


S. Aakur, F. D. M. de Souza, S. Sarkar (2019). "Going Deeper with Semantics: Exploiting Semantic Contextualization for Interpretation of Human Activity in Videos", Winter Conference on Applications of Computer Vision.

S. Aakur, F. D. M. de Souza, S. Sarkar (2019). "Generating Open World Descriptions of Video using Commonsense Knowledge in a Pattern Theory Framework", Quarterly of Applied Mathematics.

S. Aakur, F. D. M. de Souza, S. Sarkar (2019). "On the Inherent Explainability of Pattern Theory-based Video Event Interpretations", Explainable and Interpretable Models in Computer Vision and Machine Learning in the Springer Series on Challenges in Machine Learning.

S. Aakur, F. D. M. de Souza, S. Sarkar (2018). "An Inherently Explainable Model for Video Activity Interpretation", AAAI Workshop On Reasoning and Learning for Human-Machine Dialogues.

S. Aakur, F. D. M. de Souza, S. Sarkar (2017). "Towards a Knowledge-based approach for Generating Video Descriptions", Conference on Computer and Robot Vision.

F. D. M. de Souza, S. Sarkar, A. Srivatsava, J. Su (2017). "Spatially coherent interpretations of videos using pattern theory", International Journal of Computer Vision.

F. D. M. de Souza, S. Sarkar, G. Cámara-Chávez (2016). "Building semantic understanding beyond deep learning from sound and vision", International Conference on Pattern Recognition.

F. D. M. de Souza, S. Sarkar, A. Srivatsava, J. Su (2015). "Temporally Coherent Interpretations for Long Videos Using Pattern Theory", Conference on Pattern Recognition and Computer Vision.

F. D. M. de Souza, S. Sarkar, A. Srivatsava, J. Su (2014). "Pattern Theory-Based Interpretation of Activities", International Conference on Pattern Recognition.


Title: Self Supervised Event Segmentation

Description: Temporal segmentation of long videos is an important problem, that has largely been tackled through supervised learning, often requiring large amounts of annotated training data. In this paper, we tackle the problem of selfsupervised temporal segmentation that alleviates the need for any supervision in the form of labels (full supervision) or temporal ordering (weak supervision). We introduce a self-supervised, predictive learning framework that draws inspiration from cognitive psychology to segment long, visually complex videos into constituent events. Learning involves only a single pass through the training data. We also introduce a new adaptive learning paradigm that helps reduce the effect of catastrophic forgetting in recurrent neural networks. Extensive experiments on publicly available datasets show the efficacy of the proposed approach. We show that the proposed approach outperforms weakly-supervised and unsupervised baselines and achieves competitive segmentation results compared to fully supervised baselines with only a single pass through the training data. Finally, we show that the proposed self-supervised learning paradigm learns highly discriminating features to improve action recognition.


S. N. Aakur, S. Sarkar (2019). "A Perceptual Prediction Framework for Self Supervised Event Segmentation", IEEE Conference on Computer Vision and Pattern Recognition.