This research was sponsored by the IBM Research, India (specifically the IBM AI Horizon Networks - IIT Bombay initiative).

Introduction

MALTA consists of simple video tutorials of two types:

  1. TFT that describes the creation of scientific toys from waste material. We downloaded these videos from www.arvindguptatoys.com/toys-from-trash and obtained consent from the content creator to use them for research.
  2. ATMA that features farmers describing and demonstrating organic farming techniques.

Both video collections have speakers in the background narrating the process in Marathi. These videos are rich in both video and audio content. TFT consists of 492 videos, with an average length of 80 seconds and around 7 sentences describing every video in each of two languages, viz., Marathi and English, along with background speech in Marathi. On the other hand, ATMA is relatively smaller, consisting of 95 videos, with an average length of 111 seconds and around 18 sentences describing every video in a single language, viz., Marathi, accompanied by background speech in Marathi.

The ground truth of MALTA was generated by instructing close to 10 annotators to pay close attention to the audio as well as the visual streams while aligning the sentence captions with the video.

Watch the Collection of rich Marathi videos : HERE!

Data

Data

Understanding videos via captioning has gained a lot of traction recently. While captions are provided alongside videos, the information about where a caption aligns within a video is missing, which could be particularly useful for indexing and retrieval..

  Dataset  

Characteristics

Characteristics

  •  587 videos

  •  Each sentence is in both the languages (Marathi and English)

  •  Audio : Marathi language

  •  Approximately 13.86 hours of recording

CLARIFICATION

For localizing sentences/captions in videos that leverages both audio and video modalities and that can generalize to new and possibly low-resource language settings.Moreover, it is a rich new dataset, whose annotation is driven by both audio and visual modalities and which is richer in the audio modality than previous datasets. Further, MALTA has sentences in two languages (including the language of the speech in the audio modality).
Till time existing datasets have data of videos and their relevant sentences/captions in english but MALTA has data of videos, sentences/captions and audio too ! These audios are in rich Marathi language, while the sentences/captions are in marathi as well as in english language.

desciption


In MALTA_V, english transcripts are aligned by attending to video alone. While in MALTA_AV, marathi transcripts are aligned by attending to video and audio also.
We empirically demonstrate on MALTA, it is effective even when the language of the speech in the videos (Marathi) is different from the language in which the sentences are expressed (English). Even with the mismatch in language, we see a significant improvement in using the audio modality with CONC-AV compared to using only the video modality.

MALTA - STATS

Some graphical representations of our dataset and annotations

word_cloud_english

Word_Cloud : English

word_cloud_marathi

Word_Cloud : Marathi

word_freq_english

Word_frequency : English

Code

The code released is a newer and better version of MALTA paper. The numbers produced by this codebase are better than the ones reported in the paper.
Indian Institute of Technology, Bombay
Powai, Mumbai 400 076,
Maharashtra, India.
P: +91 (22) 2572 2545
F: +91 (22) 2572 3480
Keep updated

Designed by IITB