DEVELOPING MULTIMODAL LEARNING METHODS FOR VIDEO UNDERSTANDING
Files
Publication or External Link
Date
Authors
Advisor
Citation
DRUM DOI
Abstract
In recent years, the field of deep learning, with a particular emphasis on multimodal representation learning, has experienced significant advancements. These advancements are largely attributable to groundbreaking progress in areas such as computer vision, voice recognition, natural language processing, and graph network learning. This progress has paved the way for a multitude of new applications. The domain of video, in particular, holds immense potential. Video is often considered the most potent form of digital content for communication and the dissemination of information. The ability to effectively and efficiently comprehend video content could prove instrumental in a variety of downstream applications. However, the task of understanding video content presents numerous challenges. These challenges stem from the inherently unstructured and complex nature of video, as well as its interactions with other forms of unstructured data, such as text and network data. These factors contribute to the difficulty of video analysis. The objective of this dissertation is to develop deep learning methodologies capable of understanding video across multiple dimensions. Furthermore, these methodologies aim to offer a degree of interpretability, which could yield valuable insights for researchers and content creators. These insights could have significant managerial implications.In the first study, I introduce an innovative network based on Long Short-Term Memory (LSTM), enhanced with a Transformer co-attention mechanism, designed for the prediction of apparent emotion in videos. Each video is segmented into clips of one-second duration, and pre-trained ResNet networks are employed to extract audio and visual features at the second level. I construct a co-attention Transformer to effectively capture the interactions between the audio and visual features that have been extracted. An LSTM network is then utilized to learn the spatiotemporal information inherent in the video. The proposed model, termed the Sec2Sec Co-attention Transformer, outperforms several state-of-the-art methods in predicting apparent emotion on a widely recognized dataset: LIRIS-ACCEDE. In addition, I conduct an extensive data analysis to explore the relationships between various dimensions of visual and audio components and their influence on video predictions. A notable feature of the proposed model is its interpretability, which enables us to study the contributions of different time points to the overall prediction. This interpretability provides valuable insights into the functioning of the model and its predictions. In the second study, I introduce a novel neural network, the Multimodal Co-attention Transformer, designed for the prediction of personality based on video data. The proposed methodology concurrently models audio, visual, and text representations, along with their intra-relationships, to achieve precise and efficient predictions. The effectiveness of the proposed approach is demonstrated through comprehensive experiments conducted on a real-world dataset, namely, First Impressions. The results indicate that the proposed model surpasses state-of-the-art methods in performance while preserving high computational efficiency. In addition to evaluating the performance of the proposed model, I also undertake a thorough interpretability analysis to examine the contribution across different levels. The insights gained from the findings offer a valuable understanding of personality predictions. Furthermore, I illustrate the practicality of video-based personality detection in predicting outcomes of MBA admissions, serving as a decision support system. This highlights the potential importance of the proposed approach for both researchers and practitioners in the field. In the third study, I present a novel generalized multimodal learning model, termed VAN, which excels in learning a unified representation of \textbf{v}isual, \textbf{a}coustic, and \textbf{n}etwork cues. Initially, I utilize state-of-the-art encoders to model each modality. To augment the efficiency of the training process, I adopt a pre-training strategy specifically designed to extract information from the music network. Subsequently, I propose a generalized Co-attention Transformer network. This network is engineered to amalgamate the three distinct types of information and to learn the intra-relationships that exist among the three modalities, a critical facet of multimodal learning. To assess the effectiveness of the proposed model, I collect a real-world dataset from TikTok, comprising over 88,000 videos. Extensive experiments demonstrate that the proposed model surpasses existing state-of-the-art models in predicting video popularity. Moreover, I have conducted a series of ablation studies to attain a deeper comprehension of the behavior of the proposed model. I also perform an interpretability analysis to study the contributions of each modality to the model performance, leveraging the unique property of the proposed co-attention structure. This research contributes to the field by proffering a more comprehensive approach to predicting video popularity on short-form video platforms.