CoCoa: An Encoder-Decoder Model for Controllable Code-switched Generation


Abstract

Code-switching has seen growing interest in recent years as an important multilingual NLP phenomenon. Generating code-switched text for data augmentation has been sufficiently well-explored. However, there is no prior work on generating code-switched text with fine-grained control on the degree of code-switching and the lexical choices used to convey formality. We present CoCoa, an encoder-decoder translation model that converts monolingual Hindi text to Hindi-English code-switched text with both encoder-side and decoder-side interventions to achieve fine-grained controllable generation. CoCoa can be invoked at test-time to synthesize code-switched text that is simultaneously faithful to syntactic and lexical attributes relevant to code-switching. CoCoa outputs were subjected to rigorous subjective and objective evaluations. Human evaluations establish that our outputs are of superior quality while being faithful to the desired attributes. We show significantly improved BLEU scores when compared with human-generated CS text. Compared to competitive baselines, we show 10% reduction in perplexity on a language modeling task and also demonstrate clear improvements on a downstream code-switched sentiment analysis task.

Paper


Dataset Description


Please click on the following link to download the Diverse-CS dataset: Download dataset

Reference

If you use this resource, please cite:
@inproceedings{mondal-etal-2022-emnlp,
    author = {Sneha Mondal, Ritika, Shreya Pathak, Preethi Jyothi, Aravindan Raghuveer},
    title = {CoCoa: An Encoder-Decoder Model for Controllable Code-switched Generation},
    year = {2022},
    booktitle = {Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
}