Multi Modal Analysis

To be updated soon

Our Recent Works

MALTA: multi-Modal And multi-Lingual Temporal sentence Alignment

MALTA consists of simple video tutorials of two types (i) TFT that describes the creation of scientific toys from waste material. (ii) ATMA that features farmers describing and demonstrating organic farming techniques.

Webpage Paper
S3VQA: Select, Substitute, Search: A New Benchmark for Knowledge-Augmented Visual Question Answering

S3VQA provides a new approach that involves Select, Substitute, and Search (SSS) for open-domain visual question answering. S3 reaches the end result for the VQA type query by first reformulating the input question and then retrieving external knowledge source facts

Webpage Paper
RUDDER: cRoss lingUal viDeo anD tExt Retrieval.

RUDDER contains video that describes the creation of scientific toys from waste material. Till time existing datasets have data of videos and their relevant sentences/captions in English but RUDDER has data of videos, sentences/captions and audio too.

Webpage Paper
AVVP: Audio Visual Video Parsing

We present a novel approach to the Audio-visual video parsing task that takes into cognizance how event categories bind to audio and visual modalities. The proposed parsing approach simultaneously detects the temporal boundaries of such events.

Webpage Paper
Investigating Modality Bias in Audio Visual Video Parsing

We provide a detailed analysis of modality bias in the existing HAN architecture, where a modality is completely ignored during prediction. We also propose a variant of feature aggregation in HAN that leads to an absolute gain for visual modality.

Webpage Paper

Contributors


Jayaprakash A
IIT Bombay

Abhishek
IIT Bombay

Jatin Lamba
IIT Bombay

Mayank Kothyari
IIT Bombay

Rishabh Dabral
IIT Bombay

Preethi Jyothi
IIT Bombay

Indian Institute of Technology, Bombay
Powai, Mumbai 400 076,
Maharashtra, India.
P: +91 (22) 2572 2545
F: +91 (22) 2572 3480
Designed by IIT Bombay