Indian Language Corpora


We are entering an era where users are likely to take advantage of AI-driven technologies in all walks of life, by interacting seamlessly with digital systems. But, as with all other technologies, it is an important challenge to make such technology accessible to people from all strata of the society, all countries and cultures. This challenge manifests itself mostly at the interface between humans and computers. Unfortunately, most languages in the world lack the linguistic resources to build large data-hungry neural models and systems that power such interfaces. Indian languages largely fall in this bracket. In this collaborative effort spearheaded by Microsoft Research India, we present new Indian language datasets that will contribute towards the development of applications and other core technologies, that will all have a great impact on the adoption of speech and text-driven digital services in India and across the globe.


Crowdsourcing Speech Data for Low-Resource Languages from Low-Income Workers

Abstract

Voice-based technologies are essential to cater to the hundreds of millions of new smartphone users. However, most of the languages spoken by these new users have little to no labelled speech data. Unfortunately, collecting labelled speech data in any language is an expensive and resource-intensive task. Moreover, existing platforms typically collect speech data only from urban speakers familiar with digital technology whose dialects are often very different from low-income users. In this paper, we explore the possibility of collecting labelled speech data directly from low-income workers. In addition to providing diversity to the speech dataset, we believe this approach can also provide valuable supplemental earning opportunities to these communities. To this end, we conducted a study where we collected labelled speech data in the Marathi language from three different user groups: low-income rural users, low-income urban users, and university students. Overall, we collected 109 hours of data from 36 participants. Our results show that the data collected from low-income participants is of comparable quality to the data collected from university students (who are typically employed to do this work) and that crowdsourcing speech data from low-income rural and urban workers is a viable method of gathering speech data.

Here is a link to the paper:
Crowdsourcing Speech Data for Low-Resource Languages from Low-Income Workers

Download


Marathi Dataset
(click to download)


Reference

If you make use of this resource, please use the following citation:
@inproceedings{marathidata,
    Author = {Basil Abraham, Danish Goel, Divya Siddarth, Kalika Bali, Manu Chopra, Monojit Choudhury, Pratik Joshi, Preethi Jyothi, Sunayana Sitaram, Vivek Seshadri},
    Title = {Crowdsourcing Speech Data for Low-Resource Languages from Low-Income Workers},
    Year = {2020},
    Booktitle = {Proceedings of the 12th Conference on Language Resources and Evaluation (LREC)},
    Pages = {2819--2826},
}


Natural Language Inference (NLI) for Code-mixed Conversations

Abstract

Natural Language Inference (NLI) is the task of inferring the logical relationship, typically entailment or contradiction, between a premise and hypothesis. Code-mixing is the use of more than one language in the same conversation or utterance, and is prevalentin multilingual communities all over the world. In this paper, we present the first dataset for code-mixed NLI, in which both the premises and hypotheses are in code-mixed Hindi-English. We use data from Hindi movies (Bollywood) as premises, and crowd-sourced hypotheses from Hindi-English bilinguals. We conduct a pilot annotation study and describe the final annotation protocol based on observations from the pilot. Currently, the data collected consists of 400 premises in the form of code-mixed conversation snippets and 2240 code-mixed hypotheses. We conduct an extensive analysis to infer the linguistic phenomena commonly observed in the dataset obtained. We evaluate the dataset using a standard mBERT-based pipeline for NLI and report results.

Here is a page with more details about the dataset:
Bollywood Movie Dataset

Links to papers here:
GLUECoS: An Evaluation Benchmark for Code-Switched NLP
A New Dataset for Natural Language Inference for Code-mixed Conversations

Reference

If you make use of this resource, please use the following citations:
@misc{2004.12376,
    Author = {Simran Khanuja and Sandipan Dandapat and Anirudh Srinivasan and Sunayana Sitaram and Monojit Choudhury},
    Title = {GLUECoS : An Evaluation Benchmark for Code-Switched NLP},
    Year = {2020},
    Booktitle = {Proceedings of the 58th annual meeting of the Association for Computational Linguistics (ACL) },
}

@misc{2004.05051,
    Author = {Simran Khanuja and Sandipan Dandapat and Sunayana Sitaram and Monojit Choudhury},
    Title = {A New Dataset for Natural Language Inference from Code-mixed Conversations},
    Year = {2020},
    Booktitle = {Proceedings of the 12th Conference on Language Resources and Evaluation (LREC)},
}