Video data captures a tremendous amount of data that encompasses both visual and semantic knowledge. Traditional approaches to video activity understanding is based on training machine learning models or, more recently, a variety of deep learning approaches to capture underlying semantics of the video using human-annotated training data. However, this restricts the trained models to the ontology given by the annotations. A deeper understanding of video activities extends beyond recognition of underlying concepts such as actions and objects: constructing deep semantic representations requires reasoning about the semantic relationships among these concepts, often beyond what is directly observed in the data.
Title: Pattern Theory-based and Commonsense Knowledge for Event Interpretation
Description: We propose an energy minimization framework that leverages large-scale commonsense knowledge bases, such as ConceptNet, to provide contextual cues to establish semantic relationships among entities directly hypothesized from video signal. We mathematically express this using the language of Grenander's canonical pattern generator theory. We show that the use of prior encoded commonsense knowledge alleviate the need for large annotated training datasets and help tackle imbalance in training through prior knowledge. Through extensive experiments, we show that the use of commonsense knowledge from ConceptNet allows the proposed approach to handle various challenges such as training data imbalance, weak features, and complex semantic relationships and visual scenes. We also find that the use of commonsense knowledge allows for highly interpretable models that can be used in a dialog model for better human-machine interaction.
S. Aakur, F. D. M. de Souza, S. Sarkar (2019). "Going Deeper with Semantics: Exploiting Semantic Contextualization for Interpretation of Human Activity in Videos", Winter Conference on Applications of Computer Vision.
S. Aakur, F. D. M. de Souza, S. Sarkar (2019). "On the Inherent Explainability of Pattern Theory-based Video Event Interpretations", Explainable and Interpretable Models in Computer Vision and Machine Learning in the Springer Series on Challenges in Machine Learning.
Title: Self Supervised Event Segmentation
Description: Temporal segmentation of long videos is an important problem, that has largely been tackled through supervised learning, often requiring large amounts of annotated training data. In this paper, we tackle the problem of selfsupervised temporal segmentation that alleviates the need for any supervision in the form of labels (full supervision) or temporal ordering (weak supervision). We introduce a self-supervised, predictive learning framework that draws inspiration from cognitive psychology to segment long, visually complex videos into constituent events. Learning involves only a single pass through the training data. We also introduce a new adaptive learning paradigm that helps reduce the effect of catastrophic forgetting in recurrent neural networks. Extensive experiments on publicly available datasets show the efficacy of the proposed approach. We show that the proposed approach outperforms weakly-supervised and unsupervised baselines and achieves competitive segmentation results compared to fully supervised baselines with only a single pass through the training data. Finally, we show that the proposed self-supervised learning paradigm learns highly discriminating features to improve action recognition.