This research was sponsored by the IBM Research, India (specifically the IBM AI Horizon Networks - IIT Bombay initiative).

Brief Abstract

In this work, we present a novel approach to the Audio-visual video parsing task that takes into cognizance how event categories bind to audio and visual modalities. The proposed parsing approach simultaneously detects the temporal boundaries in terms of start and end times of such events. This task can be naturally formulated as a Multimodal Multiple Instance Learning (MMIL) problem. We show how the MMIL task can benefit from the following techniques geared toward self and cross modal learning:

  • self-supervised pre-training based on highly aligned task audio-video grounding
  • global context aware attention and
  • adversarial training
As for pre-training, we boostrap on the Uniter (style) transformer architecture using a self-supervised objective audio-video grounding over the relatively large AudioSet dataset. This pretrained model is fine-tuned on an architectural variant of the state-of-the-art Hybrid Attention Network (HAN) that uses global context aware attention and adversarial training objectives for audio visual video parsing. Further, we use a hybrid attention network and adversarial training to improve self and cross modal learning. Attentive MMIL pooling method is leveraged to adaptively explore useful audio and visual signals from different temporal segments and modalities. We present extensive experimental evaluations on the Look, Listen, and Parse (LLP) dataset [1] and compare it against HAN. We also present several ablation tests to validate the effect of pre-training, attention and adversarial training.

LLP dataset - Statistics

Dataset characteristics

  • 11,849 YouTube video clips
  • Over 25 categories
  • Human speaking, singing, baby crying, dog barking, violin playing, and car running, and vacuum cleaning etc
  • Each video is 10slong
  • Total of 32.9 hours collected from AudioSet
word_cloud_english

Event frequency statistics

Paper

Code

Contributors


Jatin Lamba
IIT Bombay

Abhishek
IIT Bombay

Jayaprakash A
IIT Bombay

Rishabh Dabral
IIT Bombay

Preethi Jyothi
IIT Bombay