Our Recent Works

MALTA: multi-Modal And multi-Lingual Temporal sentence Alignment

MALTA consists of simple video tutorials of two types (i) TFT that describes the creation of scientific toys from waste material. (ii) ATMA that features farmers describing and demonstrating organic farming techniques.

Webpage

S3VQA: Select, Substitute, Search: A New Benchmark for Knowledge-Augmented Visual Question Answering

S3VQA provides a new approach that involves Select, Substitute, and Search (SSS) for open-domain visual question answering. S3 reaches the end result for the VQA type query by first reformulating the input question and then retrieving external knowledge source facts

Webpage

RUDDER: cRoss lingUal viDeo anD tExt Retrieval.

RUDDER contains video that describes the creation of scientific toys from waste material. Till time existing datasets have data of videos and their relevant sentences/captions in English but RUDDER has data of videos, sentences/captions and audio too.

Webpage

AVVP: Audio Visual Video Parsing

We present a novel approach to the Audio-visual video parsing task that takes into cognizance how event categories bind to audio and visual modalities. The proposed parsing approach simultaneously detects the temporal boundaries of such events.

Webpage

Investigating Modality Bias in Audio Visual Video Parsing

We provide a detailed analysis of modality bias in the existing HAN architecture, where a modality is completely ignored during prediction. We also propose a variant of feature aggregation in HAN that leads to an absolute gain for visual modality.

Webpage

Multi Modal Analysis

Our Recent Works

MALTA: multi-Modal And multi-Lingual Temporal sentence Alignment

S3VQA: Select, Substitute, Search: A New Benchmark for Knowledge-Augmented Visual Question Answering

RUDDER: cRoss lingUal viDeo anD tExt Retrieval.

AVVP: Audio Visual Video Parsing

Investigating Modality Bias in Audio Visual Video Parsing

Contributors

Designed by IIT Bombay