Multi Modal Analysis

Our Recent Works

MALTA: multi-Modal And multi-Lingual Temporal sentence Alignment

MALTA consists of simple video tutorials of two types (i) TFT that describes the creation of scientific toys from waste material. (ii) ATMA that features farmers describing and demonstrating organic farming techniques.

S3VQA: Select, Substitute, Search: A New Benchmark for Knowledge-Augmented Visual Question Answering

S3VQA provides a new approach that involves Select, Substitute, and Search (SSS) for open-domain visual question answering. S3 reaches the end result for the VQA type query by first reformulating the input question and then retrieving external knowledge source facts

RUDDER: cRoss lingUal viDeo anD tExt Retrieval.

RUDDER contains video that describes the creation of scientific toys from waste material. Till time existing datasets have data of videos and their relevant sentences/captions in English but RUDDER has data of videos, sentences/captions and audio too.

AVVP: Audio Visual Video Parsing

We present a novel approach to the Audio-visual video parsing task that takes into cognizance how event categories bind to audio and visual modalities. The proposed parsing approach simultaneously detects the temporal boundaries of such events.

