Generating code-switched text is a problem of growing interest, especially given the scarcity of corpora containing large volumes of real code-switched text. In this work, we adapt a state-of-the-art neural machine translation model to generate Hindi-English code-switched sentences starting from monolingual Hindi sentences. We outline a carefully designed curriculum of pretraining steps, including the use of synthetic code-switched text, that enable the model to generate high-quality code-switched text. Using text generated from our model as data augmentation, we show significant reductions in perplexity on a language modeling task, compared to using text from other generative models of CS text. We also show improvements using our text for a downstream code-switched natural language inference task. Our generated text is further subjected to a rigorous evaluation using a human evaluation study and a range of objective metrics, where we show performance comparable (and sometimes even superior) to code-switched text obtained via crowd workers who are native Hindi speakers.
All-CS data is partitioned into two subsets : MovieCS and TreebankCS.
MovieCS consists of conversational Hindi-English CS text extracted from 30 contemporary Bollywood scripts that were publicly available. We employed a professional annotation company to convert the Romanized Hindi words in the dataset to their backtranslated forms in devanagari and further asked them to provide monolingual translations for all these sentences. The code-switched variations for these sentences were generated via Amazon's Mechanical Turk platform.
TreebankCS consists of monolingual sentences from the publicly available Hindi Dependency Treebank that contains dependency parses. The code-switched annotations for these sentences was also collected via the Mechanical Turk platform, with the additional instructions to the turkers to switch atleast one dependency-parsed chunk to English. This led to longer spans of English segments within each sentence, as is visible in the following figure.
Fig.1 - Distribution across overall sentence lengths and distribution across lengths of continuous English spans in MovieCS and TreebankCS.
Dataset Collection
Fig.2 depicts the portal used to collect data using Amazon’s Mechanical Turk platform. The collection was done in two rounds, first for MovieCS and then for TreebankCS. With TreebankCS, the sentences were first divided into chunks and the Turkers were provided with a sentence grouped into chunks as shown in Fig 2. They were required to switch at least one chunk in the sentence entirely to English so as to ensure a longer span of English words in the resulting CS sentence. A suggestion box converted transliterated Hindi words into Devanagari and also provided English suggestions to aid the workers in completing their task. With MovieCS, since there were no chunk labels associated with the sentences, they were tokenized into words.
Fig.2 - A snapshot of the web interface used to collect MovieCS and TreebankCS data via Amazon MechanicalTurk.
The All-CS.json file contains the All-CS data as mentioned in our work
Each entry in the file contains the following fields :
dataset - This field can either be "moviecs" or "treebankcs", to signify whether the entry comes from the moviecs dataset or the treebankcs dataset.
id - This field signifies the id (identifier) of the field. The id is different for moviecs and treebankcs datasets.
split - This field denotes the part of the split which the entry is a part of. It can either be "train", "test" or "valid".
mono_raw - This is the original monolingual sentence.
mono - This is the monolingual sentence labelled with named entity tags. The named entities are identified by a "/NE/" label appended at the end of the word.
mturk - This field contains the responses obtained from Amazon Mechanical Turk, labelled with named-entity tags.
gold - This field contains the gold response obtained by the professional annotation company. It is only available for the moviecs dataset.
We follow the architecture from the repository UnsupervisedMT. The model comprises of three layers of stacked Transformer encoder and decoder layers, two of which are shared and the remaining layer is private to each language. Monolingual Hindi (i.e. the source language) has its own private encoder and decoder layers while English and Hindi-English CS text jointly make use of the remaining private encoder and decoder layers. In our model, the target language is either English or CS text.
Fig.3 shows the overall architecture of our model.
Reference
If you use this resource, please cite:
@inproceedings{tarunesh-etal-2021-acl,
author = {Ishan Tarunesh, Syamantak Kumar, Preethi Jyothi},
title = {From Machine Translation to Code-Switching: Generating High-Quality Code-Switched Text},
year = {2021},
booktitle = {Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL)},
}