From Machine Translation to Code-Switching

Generating High-Quality Code-Switched Text


Abstract

Generating code-switched text is a problem of growing interest, especially given the scarcity of corpora containing large volumes of real code-switched text. In this work, we adapt a state-of-the-art neural machine translation model to generate Hindi-English code-switched sentences starting from monolingual Hindi sentences. We outline a carefully designed curriculum of pretraining steps, including the use of synthetic code-switched text, that enable the model to generate high-quality code-switched text. Using text generated from our model as data augmentation, we show significant reductions in perplexity on a language modeling task, compared to using text from other generative models of CS text. We also show improvements using our text for a downstream code-switched natural language inference task. Our generated text is further subjected to a rigorous evaluation using a human evaluation study and a range of objective metrics, where we show performance comparable (and sometimes even superior) to code-switched text obtained via crowd workers who are native Hindi speakers.

Paper


Please find the paper at the following link: From Machine Translation to Code-Switching:Generating High-Quality Code-Switched Text

Dataset Description


All-CS data is partitioned into two subsets : MovieCS and TreebankCS.

MovieCS consists of conversational Hindi-English CS text extracted from 30 contemporary Bollywood scripts that were publicly available. We employed a professional annotation company to convert the Romanized Hindi words in the dataset to their backtranslated forms in devanagari and further asked them to provide monolingual translations for all these sentences. The code-switched variations for these sentences were generated via Amazon's Mechanical Turk platform.

TreebankCS consists of monolingual sentences from the publicly available Hindi Dependency Treebank that contains dependency parses. The code-switched annotations for these sentences was also collected via the Mechanical Turk platform, with the additional instructions to the turkers to switch atleast one dependency-parsed chunk to English. This led to longer spans of English segments within each sentence, as is visible in the following figure.

Fig.1 - Distribution across overall sentence lengths and distribution across lengths of continuous English spans in Movie­CS and Treebank­CS.


Dataset Collection


Fig.2 depicts the portal used to collect data us­ing Amazon’s Mechanical Turk platform. The col­lection was done in two rounds, first for Movie­CS and then for Treebank­CS. With Treebank­CS, the sentences were first divided into chunks and the Turkers were provided with a sentence grouped into chunks as shown in Fig 2. They were re­quired to switch at least one chunk in the sentence entirely to English so as to ensure a longer span of English words in the resulting CS sentence. A sug­gestion box converted transliterated Hindi words into Devanagari and also provided English sugges­tions to aid the workers in completing their task. With Movie­CS, since there were no chunk labels associated with the sentences, they were tokenized into words.
Fig.2 - A snapshot of the web interface used to collect Movie­CS and Treebank­CS data via Amazon MechanicalTurk.


Download


All-CS Dataset Download

The All-CS.json file contains the All-CS data as mentioned in our work
Each entry in the file contains the following fields :
  1. dataset - This field can either be "moviecs" or "treebankcs", to signify whether the entry comes from the moviecs dataset or the treebankcs dataset.
  2. id - This field signifies the id (identifier) of the field. The id is different for moviecs and treebankcs datasets.
  3. split - This field denotes the part of the split which the entry is a part of. It can either be "train", "test" or "valid".
  4. mono_raw - This is the original monolingual sentence.
  5. mono - This is the monolingual sentence labelled with named entity tags. The named entities are identified by a "/NE/" label appended at the end of the word.
  6. mturk - This field contains the responses obtained from Amazon Mechanical Turk, labelled with named-entity tags.
  7. gold - This field contains the gold response obtained by the professional annotation company. It is only available for the moviecs dataset.


Code


Link to repository

We follow the architecture from the repository UnsupervisedMT. The model comprises of three layers of stacked Transformer encoder and decoder layers, two of which are shared and the remaining layer is private to each language. Monolingual Hindi (i.e. the source language) has its own private encoder and decoder layers while English and Hindi-English CS text jointly make use of the remaining private encoder and decoder layers. In our model, the target language is either English or CS text.
Fig.3 shows the overall architecture of our model.


Reference

If you use this resource, please cite:
@inproceedings{tarunesh-etal-2021-acl,
    author = {Ishan Tarunesh, Syamantak Kumar, Preethi Jyothi},
    title = {From Machine Translation to Code-Switching: Generating High-Quality Code-Switched Text},
    year = {2021},
    booktitle = {Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL)},
}