Spontaneous or conversational multilingual speech presents many challenges for state-of-the-art automatic speech recognition (ASR) systems. In this work, we present a new technique AMPS, that augments a multilingual multimodal ASR system with paraphrase-based supervision for improved conversational ASR in multiple languages, including Hindi, Marathi, Malayalam, Kannada, and Nyanja. We use paraphrases of the reference transcriptions as additional supervision while training the multimodal ASR model and selectively invoke this paraphrase objective for utterances with poor ASR performance. Using AMPS with a state-of-the-art multimodal model SeamlessM4T, we obtain significant relative reductions in word error rates (WERs) of up to 5%. We present detailed analyses of our system using both objective and human evaluation metrics.
@inproceedings{gupta-etal-2025-amps,title={{AMPS}: {ASR} with Multimodal Paraphrase Supervision},author={Gupta, Abhishek and Parulekar, Amruta and Chattopadhyay, Sameep and Jyothi, Preethi},booktitle={Proceedings of NAACL},year={2025},pages={404--413},}
COLING
CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving
Bhavani Shankar
P S V N, Preethi
Jyothi, and Pushpak
Bhattacharyya
Code-switching is a widely prevalent linguistic phenomenon in multilingual societies like India. Building speech-to-text models for code-switched speech is challenging due to limited availability of datasets. In this work, we focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text. We present a new end-to-end model architecture CoSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules (that are more widely available for many languages). Speech and ASR text representations are fused using an aligned interleaving scheme and are fed further as input to a pretrained MT module; the whole pipeline is then trained end-to-end for spoken translation using synthetically created ST data. We also release a new evaluation benchmark for code-switched Bengali- English, Hindi-English, Marathi-English and Telugu-English speech to English text. CoSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
2024
NeurIPS
WikiDO: A New Benchmark Evaluating Cross-Modal Retrieval for Vision-Language Models
*Pavan Kalyan
Tankala, *Piyush
Pasi, Sahil
Dharod, and
4 more authors
In Proceedings of NeurIPS (Datasets and Benchmarks Track), 2024
Cross-modal (image-to-text and text-to-image) retrieval is an established task used in evaluation benchmarks to test the performance of vision-language models (VLMs). Several state-of-the-art VLMs (e.g. CLIP, BLIP-2) have achieved near-perfect performance on widely-used image-text retrieval benchmarks such as MSCOCO-Test-5K and Flickr30K-Test-1K. As a measure of out-of-distribution (OOD) generalization, prior works rely on zero-shot performance evaluated on one dataset (Flickr) using a VLM finetuned on another one (MSCOCO). We argue that such comparisons are insufficient to assess the OOD generalization capability of models due to high visual and linguistic similarity between the evaluation and finetuning datasets. To address this gap, we introduce WikiDO (drawn from Wikipedia Diversity Observatory), a novel cross-modal retrieval benchmark to assess the OOD generalization capabilities of pretrained VLMs. This consists of newly scraped 380K image-text pairs from Wikipedia with domain labels, a carefully curated, human-verified a)in-distribution (ID) test set (3K) and b) OOD test set (3K). The image-text pairs are very diverse in topics and geographical locations. We evaluate different VLMs of varying capacity on the WikiDO benchmark; BLIP-2 achieves zero-shot performance of R@1 ≈66% on the OOD test set, compared to ≈81% on COCO and ≈95% on Flickr. When fine-tuned on WikiDO, the R@1 improvement is at most ≈5% on OOD instances compared to ≈12% on ID instances. We probe the VLMs with varying finetuning objectives and datasets of varying sizes to identify what aids OOD generalization the most. Our results confirm that WikiDO offers a strong cross-modal benchmark for current VLMs in specifically evaluating for OOD generalization. Our benchmark is hosted as a competition at https://kaggle.com/competitions/wikido24 with public access to dataset and code.
@inproceedings{tankala2024wikido,title={WikiDO: A New Benchmark Evaluating Cross-Modal Retrieval for Vision-Language Models},author={Tankala, Pavan Kalyan and Pasi, Piyush and Dharod, Sahil and Motiwala, Azeem and Jyothi, Preethi and Chaudhary, Aditi and Srinivasan, Krishna},booktitle={Proceedings of NeurIPS (Datasets and Benchmarks Track)},pages={140812--140827},year={2024}}
Interspeech
SALSA: Speedy ASR-LLM Synchronous Aggregation
Ashish
Mittal, Darshan
Prabhu, Sunita
Sarawagi, and
1 more author
In Proceedings of Interspeech This work was nominated for a Best Student Paper Award
, 2024
Harnessing pre-trained LLMs to improve ASR systems, particularly for low-resource languages, is now an emerging area of research. Existing methods range from using LLMs for ASR error correction to tightly coupled systems that replace the ASR decoder with the LLM. These approaches either increase decoding time or require expensive training of the cross-attention layers. We propose SALSA, which couples the decoder layers of the ASR to the LLM decoder, while synchronously advancing both decoders. Such coupling is performed with a simple projection of the last decoder state, and is thus significantly more training efficient than earlier approaches. A challenge of our proposed coupling is handling the mismatch between the tokenizers of the LLM and ASR systems. We handle this mismatch using cascading tokenization with respect to the LLM and ASR vocabularies. We evaluate SALSA on 8 low-resource languages in the FLEURS benchmark, yielding substantial WER reductions of up to 38%.
@inproceedings{mittal2024salsa,title={SALSA: Speedy ASR-LLM Synchronous Aggregation},author={Mittal, Ashish and Prabhu, Darshan and Sarawagi, Sunita and Jyothi, Preethi},booktitle={Proceedings of Interspeech},pages={3485--3489},year={2024},}
Interspeech
Emotion arithmetic: Emotional speech synthesis via weight space interpolation
Pavan
Kalyan, Preeti
Rao, Preethi
Jyothi, and
1 more author
While the idea of task arithmetic has been shown to be useful to steer the behaviour of neural models for NLP and vision tasks, it has not yet been used for speech. Moreover the tasks studied have been restricted to text classification and generation, and image classification. We extend the idea of task vectors to emotional speech synthesis in this work. We build emotion vectors by subtracting the weights of a pre-trained model from the weights of the same model after fine-tuning for a given emotion. These emotion vectors can be modified or combined through arithmetic operations such as negation and addition, with the hope of steering the behaviour of the resulting model accordingly in the generation of emotional speech. We also show that the emotion vector can achieve the desired transfer of emotion to a speaker not seen during training.
@inproceedings{kalyan2024emotion,title={Emotion arithmetic: Emotional speech synthesis via weight space interpolation},author={Kalyan, Pavan and Rao, Preeti and Jyothi, Preethi and Bhattacharyya, Pushpak},booktitle={Proc. Interspeech 2024},pages={1805--1809},year={2024},}
Interspeech
Multi-Convformer: Extending Conformer with Multiple Convolution Kernels
Darshan
Prabhu, Yifan
Peng, Preethi
Jyothi, and
1 more author
Convolutions have become essential in state-of-the-art end-toend Automatic Speech Recognition (ASR) systems due to their efficient modelling of local context. Notably, its use in Conformers has led to superior performance compared to vanilla Transformer-based ASR systems. While components other than the convolution module in the Conformer have been reexamined, altering the convolution module itself has been far less explored. Towards this, we introduce MULTI-CONVFORMER that uses multiple convolution kernels within the convolution module of the Conformer in conjunction with gating. This helps in improved modeling of local dependencies at varying granularities. Our model rivals existing Conformer variants such as CgMLP and E-Branchformer in performance, while being more parameter efficient. We empirically compare our approach with Conformer and its variants across four different datasets and three different modelling paradigms and show up to 8% relative word error rate (WER) improvements.
@inproceedings{prabhu2024multi,title={{M}ulti-{C}onvformer: Extending Conformer with Multiple Convolution Kernels},author={Prabhu, Darshan and Peng, Yifan and Jyothi, Preethi and Watanabe, Shinji},booktitle={Proceedings of Interspeech 2024},pages={232--236},year={2024},}
Interspeech
Improving Self-supervised Pre-training using Accent-Specific Codebooks
Darshan
Prabhu, Abhishek
Gupta, Omkar
Nitsure, and
2 more authors
Speech accents present a serious challenge to the performance of state-of-the-art end-to-end Automatic Speech Recognition (ASR) systems. Even with self-supervised learning and pre-training of ASR models, accent invariance is seldom achieved. In this work, we propose an accentaware adaptation technique for self-supervised learning that introduces a trainable set of accent-specific codebooks to the self-supervised architecture. These learnable codebooks enable the model to capture accent specific information during pre-training, that is further refined during ASR finetuning. On the Mozilla Common Voice dataset, our proposed approach outperforms all other accent-adaptation approaches on both seen and unseen English accents, with up to 9% relative reduction in word error rate (WER).
@inproceedings{prabhu2024improving,title={Improving Self-supervised Pre-training using Accent-Specific Codebooks},author={Prabhu, Darshan and Gupta, Abhishek and Nitsure, Omkar and Jyothi, Preethi and Ganapathy, Sriram},booktitle={Proc. Interspeech 2024},pages={2310--2314},year={2024}}
ACL
In-context mixing (ICM): Code-mixed prompts for multilingual LLMs
Bhavani
Shankar, Preethi
Jyothi, and Pushpak
Bhattacharyya
We introduce a simple and effective prompting technique called in-context mixing (ICM) for effective in-context learning (ICL) with multilingual large language models (MLLMs). With ICM, we modify the fewshot examples within ICL prompts to be intra-sententially code-mixed by randomly swapping content words in the target languages with their English translations. We observe that ICM prompts yield superior performance in NLP tasks such as disfluency correction, grammar error correction and text simplification that demand a close correspondence between the input and output sequences. Significant improvements are observed mainly for low-resource languages that are under-represented during the pretraining and finetuning of MLLMs. We present an extensive set of experiments to analyze when ICM is effective and what design choices contribute towards its effectiveness. ICM works consistently and significantly better than other prompting techniques across models of varying capacity such as mT0-XXL, BloomZ and GPT4.
@inproceedings{shankar2024context,title={In-context mixing ({ICM}): Code-mixed prompts for multilingual {LLM}s},author={Shankar, Bhavani and Jyothi, Preethi and Bhattacharyya, Pushpak},booktitle={Proceedings of ACL},pages={4162--4176},year={2024},}
ACL
Boosting Zero-Shot Crosslingual Performance using LLM-Based Augmentations with Effective Data Selection
*Barah
Fazili, *Ashish
Agrawal, and Preethi
Jyothi
Large language models (LLMs) are very proficient text generators. We leverage this capability of LLMs to generate task-specific data via zero-shot prompting and promote crosslingual transfer for low-resource target languages. Given task-specific data in a source language and a teacher model trained on this data, we propose using this teacher to label LLM generations and employ a set of simple data selection strategies that use the teacher’s label probabilities. Our data selection strategies help us identify a representative subset of diverse generations that help boost zero-shot accuracies while being efficient, in comparison to using all the LLM generations (without any subset selection). We also highlight other important design choices that affect cross-lingual performance such as the use of translations of source data and what labels are best to use for the LLM generations. We observe significant performance gains across sentiment analysis and natural language inference tasks (of up to a maximum of 7.13 absolute points and 1.5 absolute points on average) across a number of target languages (Hindi, Marathi, Urdu, Swahili) and domains.
@inproceedings{fazili2024boosting,title={Boosting Zero-Shot Crosslingual Performance using LLM-Based Augmentations with Effective Data Selection},author={Fazili, Barah and Agrawal, Ashish and Jyothi, Preethi},booktitle={Proceedings of ACL (Findings)},pages={13406--13422},year={2024}}
ACL
Part-of-speech Tagging for Extremely Low-resource Indian Languages
Sanjeev
Kumar, Preethi
Jyothi, and Pushpak
Bhattacharyya
Modern natural language processing (NLP) systems thrive when given access to large datasets. However, a large fraction of the world’s languages are not privy to such benefits due to sparse documentation and inadequate digital representation. This is especially true for Indian regional languages. As a first step towards expanding the reach of NLP technologies to extremely low-resource Indian languages, we present a new parallel part-of-speech (POS) evaluation dataset for Angika, Magahi, Bhojpuri and Hindi. Angika, Magahi, Bhojpuri, along with the more well-known Hindi, are all languages spoken in the Indian states of Bihar, Jharkhand and West Bengal. Ours is notably the first NLP resource, even for a shallow NLP task like POS-tagging, for Angika. We establish POS-tagging baselines using state-ofthe-art multilingual pretrained language models (PLMs) finetuned on Hindi data, and show zeroshot evaluations on the other three languages. While all four languages use the same Devanagari script, pretrained tokenizers underperform in zero-shot on the three languages. We propose a simple look-back fix to address the tokenization challenge yielding F1-score improvements of up to 8% on Angika, and show how it comes very close to an oracle setting when the underlying Hindi word is known (and can be accurately tokenized).
@inproceedings{kumar2024part,title={Part-of-speech Tagging for Extremely Low-resource Indian Languages},author={Kumar, Sanjeev and Jyothi, Preethi and Bhattacharyya, Pushpak},booktitle={Proceedings of ACL (Findings)},pages={14422--14431},year={2024}}
ACL
DIMSIM: Distilled Multilingual Critics for Indic Text Simplification
Sneha
Mondal, Ritika
Ritika, Ashish
Agrawal, and
2 more authors
Self-correction techniques have recently emerged as a promising framework to improve the quality of responses generated by large language models (LLMs). Few-shot prompted LLMs act as critics to produce feedback for an input, which is further fed to a refiner (also an LLM) to produce an output. However, these critique-refine steps require multiple expensive LLM calls. To circumvent this large inference cost, we borrow inspiration from prior work on knowledge distillation and propose the use of critique distillation to train critic models. These are smaller sequence-to-sequence models that are trained on input-critique pairs generated by an LLM. We focus on the problem of text simplification for three Indian languages: Hindi, Bengali and Marathi. This task is a good fit for self-correction style techniques. It also has not been systematically explored for Indian languages before. We train two separate critics that focus on lexical and structure complexity, and show that it is surprisingly more effective than using an LLM directly as a critic in both 0-shot and few-shot settings. We also show the benefits of training multilingual critics, as opposed to monolingual critics. Extensive human evaluations show that on average, raters find 80% of DIMSIM’s output to be simple and easy to read.
@inproceedings{mondal2024dimsim,title={DIMSIM: Distilled Multilingual Critics for Indic Text Simplification},author={Mondal, Sneha and Ritika, Ritika and Agrawal, Ashish and Jyothi, Preethi and Raghuveer, Aravindan},booktitle={Proceedings of ACL (Findings)},pages={16093--16109},year={2024}}
EACL
Translation Errors Significantly Impact Low-Resource Languages in Cross-Lingual Learning
Ashish*
Agrawal, Barah*
Fazili, and Preethi
Jyothi
Popular benchmarks (e.g., XNLI) used to evaluate cross-lingual language understanding consist of parallel versions of English evaluation sets in multiple target languages created with the help of professional translators. When creating such parallel data, it is critical to ensure high-quality translations for all target languages for an accurate characterization of cross-lingual transfer. In this work, we find that translation inconsistencies do exist and interestingly they disproportionally impact low-resource languages in XNLI. To identify such inconsistencies, we propose measuring the gap in performance between zero-shot evaluations on the human-translated and machine-translated target text across multiple target languages; relatively large gaps are indicative of translation errors. We also corroborate that translation errors exist for two target languages, namely Hindi and Urdu, by doing a manual reannotation of human-translated test instances in these two languages and finding poor agreement with the original English labels these instances were supposed to inherit.
@inproceedings{agrawal2024translation,title={Translation Errors Significantly Impact Low-Resource Languages in Cross-Lingual Learning},author={Agrawal, Ashish and Fazili, Barah and Jyothi, Preethi},booktitle={Proceedings of EACL},pages={319--329},year={2024}}
EACL
STORiCo: Storytelling TTS for Hindi with Character Voice Modulation
Pavan
Tankala, Preethi
Jyothi, Preeti
Rao, and
1 more author
We present a new Hindi text-to-speech (TTS) dataset and demonstrate its utility for the expressive synthesis of children’s audio stories. The dataset comprises narration by a single female speaker who modifies her voice to produce different story characters. Annotation for dialogue identification, character labelling, and character attribution are provided, all of which are expected to facilitate the learning of character voice and speaking styles. Experiments are conducted using different versions of the annotated dataset that enable training a multi-speaker TTS model on the single-speaker data. Subjective tests show that the multi-speaker model improves expressiveness and character voice consistency compared to the baseline single-speaker TTS. With the multi-speaker model, objective evaluations show comparable word error rates, better speaker voice consistency, and higher correlations with ground-truth emotion attributes. We release a new 16.8 hours storytelling speech dataset in Hindi and propose effective solutions for expressive TTS with narrator voice modulation and character voice consistency.
@inproceedings{tankala2024storico,title={STORiCo: Storytelling TTS for Hindi with Character Voice Modulation},author={Tankala, Pavan and Jyothi, Preethi and Rao, Preeti and Bhattacharyya, Pushpak},booktitle={Proceedings of EACL},pages={426--431},year={2024}}
2023
ICLR
In-situ text-only adaptation of speech models with low-overhead speech imputations
Ashish
Mittal, Sunita
Sarawagi, and Preethi
Jyothi
Fast and accurate adaptation of automatic speech recognition (ASR) systems using only text data in the target domain is a problem of long-standing practical relevance. Text-only adaptation was easy in traditional cascaded ASR systems with completely decoupled acoustic and language models. Recently, the RNNTransducer (RNN-T) has emerged as a default ASR model because of its high accuracy, low latency, and capability of supporting streaming input. However text-only adaptation of the RNN-T model is significantly more challenging due to its tight integration of acoustic and language models and end-to-end training. Existing recent approaches for text-only adaptation of RNN-Ts, either entail significant modification to the network or introduce high latency during decoding. We propose a new approach (TOLSTOI) that imputes speech representations internal to a baseline RNN-T, starting from text-only inputs, and performs in-situ adaptation that results in higher adaptation accuracy without any runtime overheads during decoding. Our imputation model is a function of the labeled data and trained parameters of the ASR model, and that we show, is more effective in controlling catastrophic forgetting compared to existing methods. We establish the effectiveness of TOLSTOI using three target domains and two ASR models of varying complexity. We yield up to 35% relative reduction in word error rate with text-only adaptation while forgetting the least compared to existing adaptation approaches. Our method is easy to implement and can be harnessed on existing RNN-T models without requiring ASR model training from scratch.
@inproceedings{mittal2023situ,title={In-situ text-only adaptation of speech models with low-overhead speech imputations},author={Mittal, Ashish and Sarawagi, Sunita and Jyothi, Preethi},booktitle={Proceedings of ICLR},year={2023}}
ICASSP
Towards zero-shot code-switched speech recognition
Brian
Yan, Matthew
Wiesner, Ondřej
Klejch, and
2 more authors
In this work, we seek to build effective code-switched (CS) automatic speech recognition systems (ASR) under the zero-shot set-ting where no transcribed CS speech data is available for training. Previously proposed frameworks which conditionally factorize the bilingual task into its constituent monolingual parts are a promising starting point for leveraging monolingual data efficiently. However, these methods require the monolingual modules to perform language segmentation. That is, each monolingual module has to simultaneously detect CS points and transcribe speech segments of one language while ignoring those of other languages – not a trivial task. We propose to simplify each monolingual module by allowing them to transcribe all speech segments indiscriminately with a monolingual script (i.e. transliteration). This simple modification passes the responsibility of CS point detection to subsequent bilingual modules which determine the final output by considering multiple monolingual transliterations along with external language model information. We apply this transliteration-based approach in an end-to-end differentiable neural network and demonstrate its efficacy for zero-shot CS ASR on Mandarin-English SEAME test sets.
@inproceedings{yan2023towards,title={Towards zero-shot code-switched speech recognition},author={Yan, Brian and Wiesner, Matthew and Klejch, Ond{\v{r}}ej and Jyothi, Preethi and Watanabe, Shinji},booktitle={Proceedings of ICASSP},pages={1--5},year={2023},}
IJCAI
Temporally aligning long audio interviews with questions: a case study in multimodal data integration
Piyush Singh
Pasi, Karthikeya
Battepati, Preethi
Jyothi, and
3 more authors
The problem of audio-to-text alignment has seen significant amount of research using complete supervision during training. However, this is typically not in the context of long audio recordings wherein the text being queried does not appear verbatim within the audio file. This work is a collaboration with a non-governmental organization called CARE India that collects long audio health surveys from young mothers residing in rural parts of Bihar, India. Given a question drawn from a questionnaire that is used to guide these surveys, we aim to locate where the question is asked within a long audio recording. This is of great value to African and Asian organizations that would otherwise have to painstakingly go through long and noisy audio recordings to locate questions (and answers) of interest. Our proposed framework, INDENT, uses a cross-attention-based model and prior information on the temporal ordering of sentences to learn speech embeddings that capture the semantics of the underlying spoken text. These learnt embeddings are used to retrieve the corresponding audio segment based on text queries at inference time. We empirically demonstrate the significant effectiveness (improvement in R-avg of about 3%) of our model over those obtained using text-based heuristics. We also show how noisy ASR, generated using state-of-the-art ASR models for Indian languages, yields better results when used in place of speech. INDENT, trained only on Hindi data is able to cater to all languages supported by the (semantically) shared text space. We illustrate this empirically on 11 Indic languages.
@inproceedings{pasi2023temporally,title={Temporally aligning long audio interviews with questions: a case study in multimodal data integration},author={Pasi, Piyush Singh and Battepati, Karthikeya and Jyothi, Preethi and Ramakrishnan, Ganesh and Mahapatra, Tanmay and Singh, Manoj},booktitle={Proceedings of IJCAI},pages={6156--6164},year={2023},}
ACL
Improving pretraining techniques for code-switched NLP
*Richeek
Das, *Sahasra
Ranjan, Shreya
Pathak, and
1 more author
In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) This work received an Outstanding Paper Award
, 2023
Pretrained models are a mainstay in modern NLP applications. Pretraining requires access to large volumes of unlabeled text. While monolingual text is readily available for many of the world’s languages, access to large quantities of code-switched text (i.e., text with tokens of multiple languages interspersed within a sentence) is much more scarce. Given this resource constraint, the question of how pretraining using limited amounts of code-switched text could be altered to improve performance for code-switched NLP becomes important to tackle. In this paper, we explore different masked language modeling (MLM) pretraining techniques for code-switched text that are cognizant of language boundaries prior to masking. The language identity of the tokens can either come from human annotators, trained language classifiers, or simple relative frequencybased estimates. We also present an MLM variant by introducing a residual connection from an earlier layer in the pretrained model that uniformly boosts performance on downstream tasks. Experiments on two downstream tasks, Question Answering (QA) and Sentiment Analysis (SA), involving four code-switched language pairs (Hindi-English, Spanish-English, Tamil-English, Malayalam-English) yield relative improvements of up to 5.8 and 2.7 F1 scores on QA (Hindi-English) and SA (TamilEnglish), respectively, compared to standard pretraining techniques. To understand our task improvements better, we use a series of probes to study what additional information is encoded by our pretraining techniques and also introduce an auxiliary loss function that explicitly models language identification to further aid the residual MLM variants.
@inproceedings{das2023improving,title={Improving pretraining techniques for code-switched {NLP}},author={Das, Richeek and Ranjan, Sahasra and Pathak, Shreya and Jyothi, Preethi},booktitle={Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},pages={1176--1191},year={2023}}
ACL
Zero-shot cross-lingual transfer with learned projections using unlabeled target-language data
Adapters have emerged as a parameter-efficient Transformer-based framework for cross-lingual transfer by inserting lightweight language-specific modules (language adapters) and task-specific modules (task adapters) within pretrained multilingual models. Zero-shot transfer is enabled by pairing the language adapter in the target language with an appropriate task adapter in a source language. If our target languages are known apriori, we explore how zero-shot transfer can be further improved within the adapter framework by utilizing unlabeled text during task-specific finetuning. We construct language-specific subspaces using standard linear algebra constructs and selectively project source-language representations into the target language subspace during task-specific finetuning using two schemes. Our experiments on three cross-lingual tasks, Named Entity Recognition (NER), Question Answering (QA) and Natural Language Inference (NLI) yield consistent benefits compared to adapter baselines over a wide variety of target languages with up to 11% relative improvement in NER, 2% relative improvement in QA and 5% relative improvement in NLI.
@inproceedings{deb2023zero,title={Zero-shot cross-lingual transfer with learned projections using unlabeled target-language data},author={Deb, Ujan and Parab, Ridayesh and Jyothi, Preethi},booktitle={Proceedings of ACL},pages={449--457},year={2023}}
ACL
DITTO: Data-efficient and Fair Targeted Subset Selection for ASR Accent Adaptation
*Suraj
Kothawade, *Anmol
Mekala, D Chandra Sekhara Hetha
Havya, and
4 more authors
State-of-the-art Automatic Speech Recognition (ASR) systems are known to exhibit disparate performance on varying speech accents. To improve performance on a specific target accent, a commonly adopted solution is to finetune the ASR model using accent-specific labeled speech. However, acquiring large amounts of labeled speech for specific target accents is challenging. Choosing an informative subset of speech samples that are most representative of the target accents becomes important for effective ASR finetuning. To address this problem, we propose DITTO (Data-efficient and Fair Targeted Subset Selection) that uses Submodular Mutual Information (SMI) functions as acquisition functions to find the most informative set of utterances matching a target accent within a fixed budget. An important feature of DITTO is that it supports fair targeting for multiple accents, i.e. it can automatically select representative data points from multiple accents when the ASR model needs to perform well on more than one accent. We show that compared to other speech selection methods, DITTO is 3-5 times as label-efficient for its improvements on the Indic-TTS and L2 datasets.
@inproceedings{kothawade2023ditto,title={{DITTO}: Data-efficient and Fair Targeted Subset Selection for {ASR} Accent Adaptation},author={Kothawade, Suraj and Mekala, Anmol and Havya, D Chandra Sekhara Hetha and Kothyari, Mayank and Iyer, Rishabh and Ramakrishnan, Ganesh and Jyothi, Preethi},booktitle={Proceedings of ACL},pages={5810--5822},year={2023}}
Interspeech
Narrator or Character: Voice Modulation in an Expressive Multi-speaker TTS
Tankala
Pavan Kalyan, Preeti
Rao, Preethi
Jyothi, and
1 more author
Current Text-to-Speech (TTS) systems are trained on audiobook data and perform well in synthesizing read-style speech. In this work, we are interested in synthesizing audio stories as narrated to children. The storytelling style is more expressive and requires perceptible changes of voice across the narrator and story characters. To address these challenges, we present a new TTS corpus of English audio stories for children with 32.7 hours of speech by a single female speaker with a UK accent. We provide evidence of the salient differences in the suprasegmentals of the narrator and character utterances in the dataset, motivating the use of a multi-speaker TTS for our application. We use a fine-tuned BERT model to label each sentence as being spoken by a narrator or character that is subsequently used to condition the TTS output. Experiments show our new TTS system is superior in expressiveness in both A-B preference and MOS testing compared to reading-style TTS and single-speaker TTS.
@inproceedings{pavan2023narrator,title={Narrator or Character: Voice Modulation in an Expressive Multi-speaker TTS},author={Pavan Kalyan, Tankala and Rao, Preeti and Jyothi, Preethi and Bhattacharyya, Pushpak},booktitle={Proceedings of Interspeech},pages={4808--4812},year={2023},}
Interspeech
Improving RNN-Transducers with Acoustic LookAhead
Vinit S
Unni, Ashish
Mittal, Preethi
Jyothi, and
1 more author
RNN-Transducers (RNN-Ts) have gained widespread acceptance as an end-to-end model for speech to text conversion because of their high accuracy and streaming capabilities. A typical RNN-T independently encodes the input audio and the text context, and combines the two encodings by a thin joint network. While this architecture provides SOTA streaming accuracy, it also makes the model vulnerable to strong LM biasing which manifests as multi-step hallucination of text without acoustic evidence. In this paper we propose LOOKAHEAD that makes text representations more acoustically grounded by looking ahead into the future within the audio input. This technique yields a significant 5%-20% relative reduction in word error rate on both in-domain and out-of-domain evaluation sets.
@inproceedings{unni2023improving,title={Improving RNN-Transducers with Acoustic LookAhead},author={Unni, Vinit S and Mittal, Ashish and Jyothi, Preethi and Sarawagi, Sunita},booktitle={Proceedings of Interspeech},pages={4419--4423},year={2023},}
Interspeech
Unsupervised Code-switched Text Generation from Parallel Text
Jie
Chi, Brian
Lu, Jason
Eisner, and
3 more authors
There has been great interest in developing automatic speech recognition (ASR) systems that can handle code-switched (CS) speech to meet the needs of a growing bilingual population. However, existing datasets are limited in size. It is expensive and difficult to collect real transcribed spoken CS data due to the challenges of finding and identifying CS data in the wild. As a result, many attempts have been made to generate synthetic CS data. Existing methods either require the existence of CS data during training, or are driven by linguistic knowledge. We introduce a novel approach of forcing a multilingual MT system that was trained on non-CS data to generate CS translations. Comparing against two prior methods, we show that simply leveraging the shared representations of two languages (Mandarin and English) yields better CS text generation and, ultimately, better CS ASR.
@inproceedings{chi2023unsupervised,title={Unsupervised Code-switched Text Generation from Parallel Text},author={Chi, Jie and Lu, Brian and Eisner, Jason and Bell, Peter and Jyothi, Preethi and Ali, Ahmed M},booktitle={Proc. Interspeech 2023},pages={1419--1423},year={2023},}
EMNLP
Accented Speech Recognition With Accent-specific Codebooks
Darshan
Prabhu, Preethi
Jyothi, Sriram
Ganapathy, and
1 more author
Speech accents pose a significant challenge to state-of-the-art automatic speech recognition (ASR) systems. Degradation in performance across underrepresented accents is a severe deterrent to the inclusive adoption of ASR. In this work, we propose a novel accent adaptation approach for end-to-end ASR systems using cross-attention with a trainable set of codebooks. These learnable codebooks capture accent-specific information and are integrated within the ASR encoder layers. The model is trained on accented English speech, while the test data also contained accents which were not seen during training. On the Mozilla Common Voice multi-accented dataset, we show that our proposed approach yields significant performance gains not only on the seen English accents (up to 37% relative improvement in word error rate) but also on the unseen accents (up to 5% relative improvement in WER). Further, we illustrate benefits for a zero-shot transfer setup on the L2Artic dataset. We also compare the performance with other approaches based on accent adversarial training.
@inproceedings{prabhu2023accented,title={Accented Speech Recognition With Accent-specific Codebooks},author={Prabhu, Darshan and Jyothi, Preethi and Ganapathy, Sriram and Unni, Vinit},booktitle={Proceedings of EMNLP},pages={7175--7188},year={2023}}
EMNLP
Speech-enriched memory for inference-time adaptation of asr models to word dictionaries
Ashish
Mittal, Sunita
Sarawagi, Preethi
Jyothi, and
2 more authors
Despite the impressive performance of ASR models on mainstream benchmarks, their performance on rare words is unsatisfactory. In enterprise settings, often a focused list of entities (such as locations, names, etc) are available which can be used to adapt the model to the terminology of specific domains. In this paper, we present a novel inference algorithm that improves the prediction of state-of-the-art ASR models using nearest-neighbor-based matching on an inference-time word list. We consider both the Transducer architecture that is useful in the streaming setting, and state-of-the-art encoder-decoder models such as Whisper. In our approach, a list of rare entities is indexed in a memory by synthesizing speech for each entry, and then storing the internal acoustic and language model states obtained from the best possible alignment on the ASR model. The memory is organized as a trie which we harness to perform a stateful lookup during inference. A key property of our extension is that we prevent spurious matches by restricting to only word-level matches. In our experiments on publicly available datasets and private benchmarks, we show that our method is effective in significantly improving rare word recognition.
@inproceedings{mittal2023speech,title={Speech-enriched memory for inference-time adaptation of asr models to word dictionaries},author={Mittal, Ashish and Sarawagi, Sunita and Jyothi, Preethi and Saon, George and Kurata, Gakuto},booktitle={Proceedings of EMNLP},pages={14820--14835},year={2023}}
EMNLP
DISCO: A Large Scale Human Annotated Corpus for Disfluency Correction in Indo-European Languages
Vineet
Bhat, Preethi
Jyothi, and Pushpak
Bhattacharyya
Disfluency correction (DC) is the process of removing disfluent elements like fillers, repetitions and corrections from spoken utterances to create readable and interpretable text. DC is a vital post-processing step applied to Automatic Speech Recognition (ASR) outputs, before subsequent processing by downstream language understanding tasks. Existing DC research has primarily focused on English due to the unavailability of large-scale open-source datasets. Towards the goal of multilingual disfluency correction, we present a high-quality human-annotated DC corpus covering four important Indo-European languages: English, Hindi, German and French. We provide extensive analysis of results of state-of-the-art DC models across all four languages obtaining F1 scores of 97.55 (English), 94.29 (Hindi), 95.89 (German) and 92.97 (French). To demonstrate the benefits of DC on downstream tasks, we show that DC leads to 5.65 points increase in BLEU scores on average when used in conjunction with a state-of-the-art Machine Translation (MT) system.
@inproceedings{bhat2023disco,title={{DISCO}: A Large Scale Human Annotated Corpus for Disfluency Correction in Indo-European Languages},author={Bhat, Vineet and Jyothi, Preethi and Bhattacharyya, Pushpak},booktitle={Proceedings of EMNLP (Findings)},pages={12833--12857},year={2023}}
ICLR (Workshop)
Surprisingly Simple Adapter Ensembling for Zero-shot Cross-lingual Sequence Tagging
Adapters are parameter-efficient modules added to pretrained Transformer models that facilitate cross-lingual transfer. Language adapters and task adapters can be separately trained and zero-shot transfer is enabled by pairing the language adapter in the target language with a task adapter trained on a high-resource language. However, there are many languages and dialects for which training language adapters would be difficult. In this work, we present a simple and efficient ensembling technique to transfer task knowledge to unseen target languages for which no language adapters exist. We compute a uniformly-weighted ensemble model over the top language adapters based on how well they perform on the test set of a high-resource language. We outperform the state-of-the-art model for this specific setting on named entity recognition (NER) and part-of-speech tagging (POS), across nine typologically diverse languages with relative performance improvements of up to 29% and 9% on NER and POS, respectively, on select target languages.
@article{shahsurprisingly,title={Surprisingly Simple Adapter Ensembling for Zero-shot Cross-lingual Sequence Tagging},author={Shah, Rohan and Jyothi, Preethi},booktitle={Practical ML for Developing Countries Workshop, ICLR 2023},year={2023},}
2022
ACL
Accurate Online Posterior Alignments for Principled Lexically-Constrained Decoding
Soumya
Chatterjee, Sunita
Sarawagi, and Preethi
Jyothi
Online alignment in machine translation refers to the task of aligning a target word to a source word when the target sequence has only been partially decoded. Good online alignments facilitate important applications such as lexically constrained translation where user-defined dictionaries are used to inject lexical constraints into the translation model. We propose a novel posterior alignment technique that is truly online in its execution and superior in terms of alignment error rates compared to existing methods. Our proposed inference technique jointly considers alignment and token probabilities in a principled manner and can be seamlessly integrated within existing constrained beam-search decoding algorithms. On five language pairs, including two distant language pairs, we achieve consistent drop in alignment error rates. When deployed on seven lexically constrained translation tasks, we achieve significant improvements in BLEU specifically around the constrained positions.
@inproceedings{chatterjee2022accurate,title={Accurate Online Posterior Alignments for Principled Lexically-Constrained Decoding},author={Chatterjee, Soumya and Sarawagi, Sunita and Jyothi, Preethi},booktitle={Proceedings of ACL},pages={6675--6689},year={2022}}
COLING
Aligning multilingual embeddings for improved code-switched natural language understanding
Multilingual pretrained models, while effective on monolingual data, need additional training to work well with code-switched text. In this work, we present a novel idea of training multilingual models with alignment objectives using parallel text so as to explicitly align word representations with the same underlying semantics across languages. Such an explicit alignment step has a positive downstream effect and improves performance on multiple code-switched NLP tasks. We explore two alignment strategies and report improvements of up to 7.32%, 0.76% and 1.9% on Hindi-English Sentiment Analysis, Named Entity Recognition and Question Answering tasks compared to a competitive baseline model.
@inproceedings{fazili2022aligning,title={Aligning multilingual embeddings for improved code-switched natural language understanding},author={Fazili, Barah and Jyothi, Preethi},booktitle={Proceedings of COLING},pages={4268--4273},year={2022},}
COLING
Zero-shot disfluency detection for Indian languages
Rohit
Kundu, Preethi
Jyothi, and Pushpak
Bhattacharyya
Disfluencies that appear in the transcriptions from automatic speech recognition systems tend to impair the performance of downstream NLP tasks. Disfluency correction models can help alleviate this problem. However, the unavailability of labeled data in low-resource languages impairs progress. We propose using a pretrained multilingual model, finetuned only on English disfluencies, for zero-shot disfluency detection in Indian languages. We present a detailed pipeline to synthetically generate disfluent text and create evaluation datasets for four Indian languages: Bengali, Hindi, Malayalam, and Marathi. Even in the zero-shot setting, we obtain F1 scores of 75 and higher on five disfluency types across all four languages. We also show the utility of synthetically generated disfluencies by evaluating on real disfluent text in Bengali, Hindi, and Marathi. Finetuning the multilingual model on additional synthetic Hindi disfluent text nearly doubles the number of exact matches and yields a 20-point boost in F1 scores when evaluated on real Hindi disfluent text, compared to training with only English disfluent text.
@inproceedings{kundu2022zero,title={Zero-shot disfluency detection for Indian languages},author={Kundu, Rohit and Jyothi, Preethi and Bhattacharyya, Pushpak},booktitle={Proceedings of COLING},pages={4442--4454},year={2022},}
EMNLP
CoCoa: An Encoder-Decoder Model for Controllable Code-switched Generation
Sneha
Mondal, Shreya
Pathak, Preethi
Jyothi, and
2 more authors
Code-switching has seen growing interest in recent years as an important multilingual NLP phenomenon. Generating code-switched text for data augmentation has been sufficiently well-explored. However, there is no prior work on generating code-switched text with fine-grained control on the degree of code-switching and the lexical choices used to convey formality. We present CoCoa, an encoder-decoder translation model that converts monolingual Hindi text to Hindi-English code-switched text with both encoder-side and decoder-side interventions to achieve fine-grained controllable generation. CoCoa can be invoked at test-time to synthesize code-switched text that is simultaneously faithful to syntactic and lexical attributes relevant to code-switching. CoCoa outputs were subjected to rigorous subjective and objective evaluations. Human evaluations establish that our outputs are of superior quality while being faithful to desired attributes. We show significantly improved BLEU scores when compared with human-generated code-switched references. Compared to competitive baselines, we show 10% reduction in perplexity on a language modeling task and also demonstrate clear improvements on a downstream code-switched sentiment analysis task.
@inproceedings{mondal2022cocoa,title={CoCoa: An Encoder-Decoder Model for Controllable Code-switched Generation},author={Mondal, Sneha and Pathak, Shreya and Jyothi, Preethi and Raghuveer, Aravindan and others},booktitle={Proceedings of EMNLP},pages={2466--2479},year={2022},}
EMNLP
Partitioned Gradient Matching-based Data Subset Selection for Compute-Efficient Robust ASR Training
Ashish
Mittal, Durga
Sivasubramanian, Rishabh
Iyer, and
2 more authors
Training state-of-the-art ASR systems such as RNN-T often has a high associated financial and environmental cost. Training with a subset of training data could mitigate this problem if the subset selected could achieve on-par performance with training with the entire dataset. Although there are many data subset selection(DSS) algorithms, direct application to the RNN-T is difficult, especially the DSS algorithms that are adaptive and use learning dynamics such as gradients, as RNN-T tend to have gradients with a significantly larger memory footprint. In this paper, we propose Partitioned Gradient Matching (PGM) a novel distributable DSS algorithm, suitable for massive datasets like those used to train RNN-T. Through extensive experiments on Librispeech 100H and Librispeech 960H, we show that PGM achieves between 3x to 6x speedup with only a very small accuracy degradation (under 1% absolute WER difference). In addition, we demonstrate similar results for PGM even in settings where the training data is corrupted with noise.
@inproceedings{mittal2022partitioned,title={Partitioned Gradient Matching-based Data Subset Selection for Compute-Efficient Robust ASR Training},author={Mittal, Ashish and Sivasubramanian, Durga and Iyer, Rishabh and Jyothi, Preethi and Ramakrishnan, Ganesh},booktitle={Proceedings of EMNLP (Findings)},pages={5999--6010},year={2022}}
ICASSP
Adaptive discounting of implicit language models in rnn-transducers
Vinit
Unni, Shreya
Khare, Ashish
Mittal, and
3 more authors
RNN-Transducer (RNN-T) models have become synonymous with streaming end-to-end ASR systems. While they perform competitively on a number of evaluation categories, rare words pose a serious challenge to RNN-T models. One main reason for the degradation in performance on rare words is that the language model (LM) internal to RNN-Ts can be-come overconfident and lead to hallucinated predictions that are acoustically inconsistent with the underlying speech. To address this issue, we propose a lightweight adaptive LM dis-counting technique ADAPTLMD, that can be used with any RNN-T architecture without requiring any external resources or additional parameters. ADAPTLMD uses a two-pronged approach: 1. Randomly mask the prediction network output to encourage the RNN-T to not be overly reliant on it’s outputs. 2. Dynamically choose when to discount the implicit LM (ILM) based on rarity of recently predicted tokens and divergence between ILM and implicit acoustic model (IAM) scores. Comparing ADAPTLMD to a competitive RNN-T baseline, we obtain up to 4% and 14% relative reductions in overall WER and rare word PER, respectively, on a conversational, code-mixed Hindi-English ASR task.
@inproceedings{unni2022adaptive,title={Adaptive discounting of implicit language models in rnn-transducers},author={Unni, Vinit and Khare, Shreya and Mittal, Ashish and Jyothi, Preethi and Sarawagi, Sunita and Bharadwaj, Samarth},booktitle={Proceedings of ICASSP},pages={8122--8126},year={2022}}
Interspeech
SPLICEOUT: A Simple and Efficient Audio Augmentation Method
Arjit
Jain, Pranay Reddy
Samala, Deepak
Mittal, and
2 more authors
In Proceedings of Interspeech Pseudocode in the arxiv version
, 2022
Time masking has become a de facto augmentation technique for speech and audio tasks, including automatic speech recognition (ASR) and audio classification, most notably as a part of SpecAugment. In this work, we propose SpliceOut, a simple modification to time masking which makes it computationally more efficient. SpliceOut performs comparably to (and sometimes outperforms) SpecAugment on a wide variety of speech and audio tasks, including ASR for seven different languages using varying amounts of training data, as well as on speech translation, sound and music classification, thus establishing itself as a broadly applicable audio augmentation method. SpliceOut also provides additional gains when used in conjunction with other augmentation techniques. Apart from the fully-supervised setting, we also demonstrate that SpliceOut can complement unsupervised representation learning with performance gains in the semi-supervised and self-supervised settings.
@inproceedings{jain2022spliceout,title={{SPLICEOUT}: A Simple and Efficient Audio Augmentation Method},author={Jain, Arjit and Samala, Pranay Reddy and Mittal, Deepak and Jyothi, Preethi and Singh, Maneesh},booktitle={Proceedings of Interspeech},pages={2678--2682},year={2022}}
Interspeech
Linguistically Informed Post-processing for ASR Error correction in Sanskrit.
Rishabh
Kumar, Devaraja
Adiga, Rishav
Ranjan, and
4 more authors
We propose an ASR system for Sanskrit, a lowresource language, that effectively combines subword tokenisation strategies and search space enrichment with linguistic information. More specifically, to address the challenges due to the high degree of out-of-vocabulary entries present in the language, we first use a subword-based language model and acoustic model to generate a search space. The search space, so obtained, is converted into a word-based search space and is further enriched with morphological and lexical information based on a shallow parser. Finally, the transitions in the search space are rescored using a supervised morphological parser proposed for Sanskrit. Our proposed approach currently reports the state-of-the-art results in Sanskrit ASR, with a 7.18 absolute point reduction in WER than the previous state-of-the-art.
@inproceedings{kumar2022linguistically,title={Linguistically Informed Post-processing for ASR Error correction in Sanskrit.},author={Kumar, Rishabh and Adiga, Devaraja and Ranjan, Rishav and Krishna, Amrith and Ramakrishnan, Ganesh and Goyal, Pawan and Jyothi, Preethi},booktitle={Proceedings of Interspeech},pages={2293--2297},year={2022}}
2021
EMNLP (Workshop)
The Effectiveness of Intermediate-Task Training for Code-Switched Natural Language Understanding
Archiki
Prasad, Mohammad Ali
Rehan, Shreya
Pathak, and
1 more author
In Proceedings of the 1st Workshop on Multilingual Representation Learning (MRL) This work received an Honorable Mention Award
, 2021
While recent benchmarks have spurred a lot of new work on improving the generalization of pretrained multilingual language models on multilingual tasks, techniques to improve code-switched natural language understanding tasks have been far less explored. In this work, we propose the use of bilingual intermediate pretraining as a reliable technique to derive large and consistent performance gains on three different NLP tasks using code-switched text. We achieve substantial absolute improvements of 7.87%, 20.15%, and 10.99%, on the mean accuracies and F1 scores over previous state-of-the-art systems for Hindi-English Natural Language Inference (NLI), Question Answering (QA) tasks, and Spanish-English Sentiment Analysis (SA) respectively. We show consistent performance gains on four different code-switched language-pairs (Hindi-English, Spanish-English, Tamil-English and Malayalam-English) for SA. We also present a code-switched masked language modelling (MLM) pretraining technique that consistently benefits SA compared to standard MLM pretraining using real code-switched text.
@inproceedings{prasad2021effectiveness,title={The Effectiveness of Intermediate-Task Training for Code-Switched Natural Language Understanding},author={Prasad, Archiki and Rehan, Mohammad Ali and Pathak, Shreya and Jyothi, Preethi},booktitle={Proceedings of the 1st Workshop on Multilingual Representation Learning ({MRL})},pages={176--190},year={2021}}
Interspeech
Reduce and Reconstruct: ASR for Low-Resource Phonetic Languages
Anuj
Diwan and Preethi
Jyothi
In Proceedings of Interspeech This work was nominated for a Best Student Paper Award
, 2021
This work presents a seemingly simple but effective technique to improve low-resource ASR systems for phonetic languages. By identifying sets of acoustically similar graphemes in these languages, we first reduce the output alphabet of the ASR system using linguistically meaningful reductions and then reconstruct the original alphabet using a standalone module. We demonstrate that this lessens the burden and improves the performance of low-resource end-to-end ASR systems (because only reduced-alphabet predictions are needed) and that it is possible to design a very simple but effective reconstruction module that recovers sequences in the original alphabet from sequences in the reduced alphabet. We present a finite state transducer-based reconstruction module that operates on the 1-best ASR hypothesis in the reduced alphabet. We demonstrate the efficacy of our proposed technique using ASR systems for two Indian languages, Gujarati and Telugu. With access to only 10 hrs of speech data, we obtain relative WER reductions of up to 7% compared to systems that do not use any reduction.
@inproceedings{diwan2021reduce,title={Reduce and Reconstruct: ASR for Low-Resource Phonetic Languages},author={Diwan, Anuj and Jyothi, Preethi},booktitle={Proceedings of Interspeech},pages={3445--3449},year={2021}}
Interspeech
Low Resource ASR: The Surprising Effectiveness of High Resource Transliteration.
Shreya
Khare, Ashish R
Mittal, Anuj
Diwan, and
3 more authors
Cross-lingual transfer of knowledge from high-resource languages to low-resource languages is an important research problem in automatic speech recognition (ASR). We propose a new strategy of transfer learning by pretraining using large amounts of speech in the high-resource language but with its text transliterated to the target low-resource language. This simple mapping of scripts explicitly encourages increased sharing between the output spaces of both languages and is surprisingly effective even when the high-resource and low-resource languages are from unrelated language families. The utility of our proposed technique is more evident in very low-resource scenarios, where better initializations are more beneficial. We evaluate our technique on a transformer ASR architecture and the state-ofthe-art wav2vec2. 0 ASR architecture, with English as the highresource language and six languages as low-resource targets. With access to 1 hour of target speech, we obtain relative WER reductions of up to 8.2% compared to existing transfer-learning approaches.
@inproceedings{khare2021low,title={Low Resource ASR: The Surprising Effectiveness of High Resource Transliteration.},author={Khare, Shreya and Mittal, Ashish R and Diwan, Anuj and Sarawagi, Sunita and Jyothi, Preethi and Bharadwaj, Samarth},booktitle={Proceedings of Interspeech},pages={1529--1533},year={2021}}
Interspeech
Cross-Modal Learning for Audio-Visual Video Parsing
Jatin
Lamba, Jayaprakash
Akula, Rishabh
Dabral, and
3 more authors
In this paper, we present a novel approach to the audio-visual video parsing (AVVP) task that demarcates events from a video separately for audio and visual modalities. The proposed parsing approach simultaneously detects the temporal boundaries in terms of start and end times of such events. We show how AVVP can benefit from the following techniques geared towards effective cross-modal learning: (i) adversarial training and skip connections (ii) global context aware attention and, (iii) self-supervised pretraining using an audio-video grounding objective to obtain cross-modal audio-video representations. We present extensive experimental evaluations on the Look, Listen, and Parse (LLP) dataset and show that we outperform the state-of-the-art Hybrid Attention Network (HAN) on all five metrics proposed for AVVP. We also present several ablations to validate the effect of pretraining, global attention and adversarial training.
@inproceedings{lamba2021cross,title={Cross-Modal Learning for Audio-Visual Video Parsing},author={Lamba, Jatin and Akula, Jayaprakash and Dabral, Rishabh and Jyothi, Preethi and Ramakrishnan, Ganesh and others},booktitle={Proceedings of Interspeech},pages={1937--1941},year={2021}}
Interspeech
MUCS 2021: Multilingual and code-switching ASR challenges for low resource Indian languages
Anuj
Diwan, Rakesh
Vaideeswaran, Sanket
Shah, and
5 more authors
Recently, there is increasing interest in multilingual automatic speech recognition (ASR) where a speech recognition system caters to multiple low resource languages by taking advantage of low amounts of labeled corpora in multiple languages. With multilingualism becoming common in today’s world, there has been increasing interest in code-switching ASR as well. In code-switching, multiple languages are freely interchanged within a single sentence or between sentences. The success of low-resource multilingual and code-switching ASR often depends on the variety of languages in terms of their acoustics, linguistic characteristics as well as the amount of data available and how these are carefully considered in building the ASR system. In this challenge, we would like to focus on building multilingual and code-switching ASR systems through two different subtasks related to a total of seven Indian languages, namely Hindi, Marathi, Odia, Tamil, Telugu, Gujarati and Bengali. For this purpose, we provide a total of 600 hours of transcribed speech data, comprising train and test sets, in these languages including two code-switched language pairs, Hindi-English and Bengali-English. We also provide a baseline recipe for both the tasks with a WER of 30.73% and 32.45% on the test sets of multilingual and code-switching subtasks, respectively.
@inproceedings{diwan2021mucs,title={{MUCS} 2021: Multilingual and code-switching {ASR} challenges for low resource Indian languages},author={Diwan, Anuj and Vaideeswaran, Rakesh and Shah, Sanket and Singh, Ankita and Raghavan, Srinivasa and Khare, Shreya and Unni, Vinit and others},pages={2446--2450},year={2021}}
ACL
From Machine Translation to Code-Switching: Generating High-Quality Code-Switched Text
Ishan
Tarunesh, Syamantak
Kumar, and Preethi
Jyothi
Generating code-switched text is a problem of growing interest, especially given the scarcity of corpora containing large volumes of real code-switched text. In this work, we adapt a state-of-the-art neural machine translation model to generate Hindi-English code-switched sentences starting from monolingual Hindi sentences. We outline a carefully designed curriculum of pretraining steps, including the use of synthetic code-switched text, that enable the model to generate high-quality code-switched text. Using text generated from our model as data augmentation, we show significant reductions in perplexity on a language modeling task, compared to using text from other generative models of CS text. We also show improvements using our text for a downstream code-switched natural language inference task. Our generated text is further subjected to a rigorous evaluation using a human evaluation study and a range of objective metrics, where we show performance comparable (and sometimes even superior) to code-switched text obtained via crowd workers who are native Hindi speakers.
@inproceedings{tarunesh2021machine,title={From Machine Translation to Code-Switching: Generating High-Quality Code-Switched Text},author={Tarunesh, Ishan and Kumar, Syamantak and Jyothi, Preethi},booktitle={Proceedings of ACL},pages={3154--3169},year={2021}}
ACL
Automatic Speech Recognition in Sanskrit: A New Speech Corpus and Modelling Insights
Devaraja
Adiga, Rishabh
Kumar, Amrith
Krishna, and
3 more authors
Automatic speech recognition (ASR) in Sanskrit is interesting, owing to the various linguistic peculiarities present in the language. The Sanskrit language is lexically productive, undergoes euphonic assimilation of phones at the word boundaries and exhibits variations in spelling conventions and in pronunciations. In this work, we propose the first large scale study of automatic speech recognition (ASR) in Sanskrit, with an emphasis on the impact of unit selection in Sanskrit ASR. In this work, we release a 78 hour ASR dataset for Sanskrit, which faithfully captures several of the linguistic characteristics expressed by the language. We investigate the role of different acoustic model and language model units in ASR systems for Sanskrit. We also propose a new modelling unit, inspired by the syllable level unit selection, that captures character sequences from one vowel in the word to the next vowel. We also highlight the importance of choosing graphemic representations for Sanskrit and show the impact of this choice on word error rates (WER). Finally, we extend these insights from Sanskrit ASR for building ASR systems in two other Indic languages, Gujarati and Telugu. For both these languages, our experimental results show that the use of phonetic based graphemic representations in ASR results in performance improvements as compared to ASR systems that use native scripts.
@inproceedings{adiga2021automatic,title={Automatic Speech Recognition in Sanskrit: A New Speech Corpus and Modelling Insights},author={Adiga, Devaraja and Kumar, Rishabh and Krishna, Amrith and Jyothi, Preethi and Ramakrishnan, Ganesh and Goyal, Pawan},booktitle={Proceedings of ACL (Findings)},pages={5039--5050},year={2021}}
IJCAI
Perturb, Predict & Paraphrase: Semi-Supervised Learning using Noisy Student for Image Captioning.
Arjit
Jain, Pranay Reddy
Samala, Preethi
Jyothi, and
1 more author
Recent semi-supervised learning (SSL) methods are predominantly focused on multi-class classification tasks. Classification tasks allow for easy mixing of class labels during augmentation which does not trivially extend to structured outputs such as word sequences that appear in tasks like image captioning. Noisy Student Training is a recent SSL paradigm proposed for image classification that is an extension of self-training and teacher-student learning. In this work, we provide an in-depth analysis of the noisy student SSL framework for the task of image captioning and derive state-of-the-art results. The original algorithm relies on computationally expensive data augmentation steps that involve perturbing the raw images and computing features for each perturbed image. We show that, even in the absence of raw image augmentation, the use of simple model and feature perturbations to the input images for the student model are beneficial to SSL training. We also show how a paraphrase generator could be effectively used for label augmentation to improve the quality of pseudo labels and significantly improve performance. Our final results in the limited labeled data setting (1% of the MS-COCO labeled data) outperform previous state-of-the-art approaches by 2.5 on BLEU4 and 11.5 on CIDEr scores.
@inproceedings{jain2021perturb,title={Perturb, Predict \& Paraphrase: Semi-Supervised Learning using Noisy Student for Image Captioning.},author={Jain, Arjit and Samala, Pranay Reddy and Jyothi, Preethi and Mittal, Deepak},booktitle={Proceedings of IJCAI},pages={758--764},year={2021}}
NAACL Workskop
The effect of pretraining on extractive summarization for scientific documents
Yash
Gupta, Pawan Sasanka
Ammanamanchi, Shikha
Bordia, and
7 more authors
In Proceedings of the Second Workshop on Scholarly Document Processing, 2021
Large pretrained models have seen enormous success in extractive summarization tasks. In this work, we investigate the influence of pretraining on a BERT-based extractive summarization system for scientific documents. We derive significant performance improvements using an intermediate pretraining step that leverages existing summarization datasets and report state-of-the-art results on a recently released scientific summarization dataset, SciTLDR. We systematically analyze the intermediate pretraining step by varying the size and domain of the pretraining corpus, changing the length of the input sequence in the target task and varying target tasks. We also investigate how intermediate pretraining interacts with contextualized word embeddings trained on different domains.
@inproceedings{gupta2021effect,title={The effect of pretraining on extractive summarization for scientific documents},author={Gupta, Yash and Ammanamanchi, Pawan Sasanka and Bordia, Shikha and Manoharan, Arjun and Mittal, Deepak and Pasunuru, Ramakanth and Shrivastava, Manish and Singh, Maneesh and Bansal, Mohit and Jyothi, Preethi},booktitle={Proceedings of the Second Workshop on Scholarly Document Processing},pages={73--82},year={2021}}
SIGIR
Select, substitute, search: A new benchmark for knowledge-augmented visual question answering
Aman
Jain, Mayank
Kothyari, Vishwajeet
Kumar, and
3 more authors
Multimodal IR, spanning text corpus, knowledge graph and images, called outside knowledge visual question answering (OKVQA), is of much recent interest. However, the popular data set has serious limitations. A surprisingly large fraction of queries do not assess the ability to integrate cross-modal information. Instead, some are independent of the image, some depend on speculation, some require OCR or are otherwise answerable from the image alone. To add to the above limitations, frequency-based guessing is very effective because of (unintended) widespread answer overlaps between the train and test folds. Overall, it is hard to determine when state-of-the-art systems exploit these weaknesses rather than really infer the answers, because they are opaque and their ’reasoning’ process is uninterpretable. An equally important limitation is that the dataset is designed for the quantitative assessment only of the end-to-end answer retrieval task, with no provision for assessing the correct(semantic) interpretation of the input query. In response, we identify a key structural idiom in OKVQA ,viz., S3 (select, substitute and search), and build a new data set and challenge around it. Specifically, the questioner identifies an entity in the image and asks a question involving that entity which can be answered only by consulting a knowledge graph or corpus passage mentioning the entity. Our challenge consists of (i)OKVQA_S3, a subset of OKVQA annotated based on the structural idiom and (ii)S3VQA, a new dataset built from scratch. We also present a neural but structurally transparent OKVQA system, S3, that explicitly addresses our challenge dataset, and outperforms recent competitive baselines.
@inproceedings{jain2021select,title={Select, substitute, search: A new benchmark for knowledge-augmented visual question answering},author={Jain, Aman and Kothyari, Mayank and Kumar, Vishwajeet and Jyothi, Preethi and Ramakrishnan, Ganesh and Chakrabarti, Soumen},booktitle={Proceedings of SIGIR},pages={2491--2498},year={2021},}
ICASSP
An investigation of end-to-end models for robust speech recognition
Archiki
Prasad, Preethi
Jyothi, and Rajbabu
Velmurugan
End-to-end models for robust automatic speech recognition (ASR) have not been sufficiently well-explored in prior work. With end-to-end models, one could choose to preprocess the input speech using speech enhancement techniques and train the model using enhanced speech. Another alternative is to pass the noisy speech as input and modify the model archi- tecture to adapt to noisy speech. A systematic comparison of these two approaches for end-to-end robust ASR has not been attempted before. We address this gap and present a de- tailed comparison of speech enhancement-based techniques and three different model-based adaptation techniques cov- ering data augmentation, multi-task learning, and adversarial learning for robust ASR. While adversarial learning is the best-performing technique on certain noise types, it comes at the cost of degrading clean speech WER. On other relatively stationary noise types, a new speech enhancement technique outperformed all the model-based adaptation techniques. This suggests that knowledge of the underlying noise type can meaningfully inform the choice of adaptation technique.
@inproceedings{prasad2021investigation,title={An investigation of end-to-end models for robust speech recognition},author={Prasad, Archiki and Jyothi, Preethi and Velmurugan, Rajbabu},booktitle={Proceedings of ICASSP},pages={6893--6897},year={2021}}
ICASSP
Error-driven fixed-budget asr personalization for accented speakers
Abhijeet
Awasthi, Aman
Kansal, Sunita
Sarawagi, and
1 more author
We consider the task of personalizing ASR models while being constrained by a fixed budget on recording speaker specific utterances. Given a speaker and an ASR model, we propose a method of identifying sentences for which the speaker’s utterances are likely to be harder for the given ASR model to recognize. We assume a tiny amount of speaker-specific data to learn phoneme-level error models which help us select such sentences. We show that speaker’s utterances on the sentences selected using our error model indeed have larger error rates when compared to speaker’s utterances on randomly selected sentences. We find that fine-tuning the ASR model on the sentence utterances selected with the help of error models yield higher WER improvements in comparison to fine-tuning on an equal number of randomly selected sentence utterances. Thus, our method provides an efficient way of collecting speaker utterances under budget constraints for personalizing ASR models.
@inproceedings{awasthi2021error,title={Error-driven fixed-budget asr personalization for accented speakers},author={Awasthi, Abhijeet and Kansal, Aman and Sarawagi, Sunita and Jyothi, Preethi},booktitle={Proceedings of ICASSP},pages={7033--7037},year={2021},}
ICASSP
Collaborative learning to generate audio-video jointly
Vinod K
Kurmi, Vipul
Bajaj, Badri N
Patro, and
3 more authors
There have been a number of techniques that have demonstrated the generation of multimedia data for one modality at a time using GANs, such as the ability to generate images, videos, and audio. However, so far, the task of multi-modal generation of data, specifically for audio and videos both, has not been sufficiently well-explored. Towards this, we propose a method that demonstrates that we are able to generate naturalistic samples of video and audio data by the joint correlated generation of audio and video modalities. The proposed method uses multiple discriminators to ensure that the audio, video, and the joint output are also indistinguishable from real-world samples. We present a dataset for this task and show that we are able to generate realistic samples. This method is validated using various standard metrics such as Inception Score, Frechet Inception Distance (FID) and through human evaluation.
@inproceedings{kurmi2021collaborative,title={Collaborative learning to generate audio-video jointly},author={Kurmi, Vinod K and Bajaj, Vipul and Patro, Badri N and Venkatesh, KS and Namboodiri, Vinay P and Jyothi, Preethi},booktitle={Proceedings of ICASSP},pages={4180--4184},year={2021}}
EACL
Disfluency correction using unsupervised and semi-supervised learning
Nikhil
Saini, Drumil
Trivedi, Shreya
Khare, and
4 more authors
Spoken language is different from the written language in its style and structure. Disfluencies that appear in transcriptions from speech recognition systems generally hamper the performance of downstream NLP tasks. Thus, a disfluency correction system that converts disfluent to fluent text is of great value. This paper introduces a disfluency correction model that translates disfluent to fluent text by drawing inspiration from recent encoder-decoder unsupervised style-transfer models for text. We also show considerable benefits in performance when utilizing a small sample of 500 parallel disfluent-fluent sentences in a semi-supervised way. Our unsupervised approach achieves a BLEU score of 79.39 on the Switchboard corpus test set, with further improvement to a BLEU score of 85.28 with semi-supervision. Both are comparable to two competitive fully-supervised models.
@inproceedings{saini2021disfluency,title={Disfluency correction using unsupervised and semi-supervised learning},author={Saini, Nikhil and Trivedi, Drumil and Khare, Shreya and Dhamecha, Tejas and Jyothi, Preethi and Bharadwaj, Samarth and Bhattacharyya, Pushpak},booktitle={Proceedings of EACL},pages={3421--3427},year={2021}}
EACL
Meta-Learning for Effective Multi-task and Multilingual Modelling
Ishan
Tarunesh, Sushil
Khyalia, Vishwajeet
Kumar, and
2 more authors
Natural language processing (NLP) tasks (e.g. question-answering in English) benefit from knowledge of other tasks (e.g. named entity recognition in English) and knowledge of other languages (e.g. question-answering in Spanish). Such shared representations are typically learned in isolation, either across tasks or across languages. In this work, we propose a meta-learning approach to learn the interactions between both tasks and languages. We also investigate the role of different sampling strategies used during meta-learning. We present experiments on five different tasks and six different languages from the XTREME multilingual benchmark dataset. Our meta-learned model clearly improves in performance compared to competitive baseline models that also include multi-task baselines. We also present zero-shot evaluations on unseen target languages to demonstrate the utility of our proposed model.
@inproceedings{tarunesh2021meta,title={Meta-Learning for Effective Multi-task and Multilingual Modelling},author={Tarunesh, Ishan and Khyalia, Sushil and Kumar, Vishwajeet and Ramakrishnan, Ganesh and Jyothi, Preethi},booktitle={Proceedings of EACL},pages={3600--3612},year={2021}}
2020
Interspeech
Black-Box Adaptation of ASR for Accented Speech
Kartik
Khandelwal, Preethi
Jyothi, Abhijeet
Awasthi, and
1 more author
We introduce the problem of adapting a black-box, cloud-based ASR system to speech from a target accent. While leading online ASR services obtain impressive performance on main-stream accents, they perform poorly on sub-populations - we observed that the word error rate (WER) achieved by Google’s ASR API on Indian accents is almost twice the WER on US accents. Existing adaptation methods either require access to model parameters or overlay an error-correcting module on output transcripts. We highlight the need for correlating outputs with the original speech to fix accent errors. Accordingly, we propose a novel coupling of an open-source accent-tuned local model with the black-box service where the output from the service guides frame-level inference in the local model. Our fine-grained merging algorithm is better at fixing accent errors than existing word-level combination strategies. Experiments on Indian and Australian accents with three leading ASR models as service, show that we achieve as much as 28% relative reduction in WER over both the local and service models.
@inproceedings{khandelwal2020black,title={Black-Box Adaptation of ASR for Accented Speech},author={Khandelwal, Kartik and Jyothi, Preethi and Awasthi, Abhijeet and Sarawagi, Sunita},booktitle={Proceedings of Interspeech},pages={1281--1285},year={2020}}
Interspeech
Improving Low Resource Code-Switched ASR Using Augmented Code-Switched TTS
Yash
Sharma, Basil
Abraham, Karan
Taneja, and
1 more author
Building Automatic Speech Recognition (ASR) systems for code-switched speech has recently gained renewed attention due to the widespread use of speech technologies in multilingual communities worldwide. End-to-end ASR systems are a natural modeling choice due to their ease of use and superior performance in monolingual settings. However, it is well known that end-to-end systems require large amounts of labeled speech. In this work, we investigate improving code-switched ASR in low resource settings via data augmentation using code-switched text-to-speech (TTS) synthesis. We propose two targeted techniques to effectively leverage TTS speech samples: 1) Mixup, an existing technique to create new training samples via linear interpolation of existing samples, applied to TTS and real speech samples, and 2) a new loss function, used in conjunction with TTS samples, to encourage code-switched predictions. We report significant improvements in ASR performance achieving absolute word error rate (WER) reductions of up to 5%, and measurable improvement in code switching using our proposed techniques on a Hindi-English code-switched ASR task.
@inproceedings{sharma2020improving,title={Improving Low Resource Code-Switched ASR Using Augmented Code-Switched TTS},author={Sharma, Yash and Abraham, Basil and Taneja, Karan and Jyothi, Preethi},booktitle={Proceedings of Interspeech},pages={4771--4775},year={2020}}
Interspeech
Caption alignment for low resource audio-visual data
Vighnesh Reddy
Konda, Mayur
Warialani, Rakesh Prasanth
Achari, and
6 more authors
Understanding videos via captioning has gained a lot of traction recently. While captions are provided alongside videos, the information about where a caption aligns within a video is missing, which could be particularly useful for indexing and retrieval. Existing work on learning to infer alignments has mostly exploited visual features and ignored the audio signal. Video understanding applications often underestimate the importance of the audio modality. We focus on how to make effective use of the audio modality for temporal localization of captions within videos. We release a new audio-visual dataset that has captions time-aligned by (i) carefully listening to the audio and watching the video, and (ii) watching only the video. Our dataset is audio-rich and contains captions in two languages, English and Marathi (a low-resource language). We further propose an attention-driven multimodal model, for effective utilization of both audio and video for temporal localization. We then investigate (i) the effects of audio in both data preparation and model design, and (ii) effective pretraining strategies (Audioset, ASR-bottleneck features, PASE, etc.) handling low-resource setting to help extract rich audio representations.
@inproceedings{konda2020caption,title={Caption alignment for low resource audio-visual data},author={Konda, Vighnesh Reddy and Warialani, Mayur and Achari, Rakesh Prasanth and Bhatnagar, Varad and Akula, Jayaprakash and Jyothi, Preethi and Ramakrishnan, Ganesh and Haffari, Gholamreza and Singh, Pankaj},booktitle={Proceedings of Interspeech},pages={3525--3529},year={2020},}
ACL
How accents confound: Probing for accent information in end-to-end speech recognition systems
In this work, we present a detailed analysis of how accent information is reflected in the internal representation of speech in an end-to-end automatic speech recognition (ASR) system. We use a state-of-the-art end-to-end ASR system, comprising convolutional and recurrent layers, that is trained on a large amount of US-accented English speech and evaluate the model on speech samples from seven different English accents. We examine the effects of accent on the internal representation using three main probing techniques: a) Gradient-based explanation methods, b) Information-theoretic measures, and c) Outputs of accent and phone classifiers. We find different accents exhibiting similar trends irrespective of the probing technique used. We also find that most accent information is encoded within the first recurrent layer, which is suggestive of how one could adapt such an end-to-end model to learn representations that are invariant to accents.
@inproceedings{prasad2020accents,title={How accents confound: Probing for accent information in end-to-end speech recognition systems},author={Prasad, Archiki and Jyothi, Preethi},booktitle={Proceedings of ACL},pages={3739--3753},year={2020},}
ICASSP
Coupled training of sequence-to-sequence models for accented speech recognition
Accented speech poses significant challenges for state-of-the-art automatic speech recognition (ASR) systems. Accent is a property of speech that lasts throughout an utterance in varying degrees of strength. This makes it hard to isolate the influence of accent on individual speech sounds. We propose coupled training for encoder-decoder ASR models that acts on pairs of utterances corresponding to the same text spoken by speakers with different accents. This training regime introduces an L2 loss between the attention-weighted representations corresponding to pairs of utterances with the same text, thus acting as a regularizer and encouraging representations from the encoder to be more accent-invariant. We focus on recognizing accented English samples from the Mozilla Common Voice corpus. We obtain significant error rate reductions on accented samples from a large set of diverse accents using coupled training. We also show consistent improvements in performance on heavily accented samples (as determined by a standalone accent classifier).
@inproceedings{unni2020coupled,title={Coupled training of sequence-to-sequence models for accented speech recognition},author={Unni, Vinit and Joshi, Nitish and Jyothi, Preethi},booktitle={Proceedings of ICASSP},pages={8254--8258},year={2020}}
LREC
Crowdsourcing speech data for low-resource languages from low-income workers
Basil
Abraham, Danish
Goel, Divya
Siddarth, and
7 more authors
Voice-based technologies are essential to cater to the hundreds of millions of new smartphone users. However, most of the languages spoken by these new users have little to no labelled speech data. Unfortunately, collecting labelled speech data in any language is an expensive and resource-intensive task. Moreover, existing platforms typically collect speech data only from urban speakers familiar with digital technology whose dialects are often very different from low-income users. In this paper, we explore the possibility of collecting labelled speech data directly from low-income workers. In addition to providing diversity to the speech dataset, we believe this approach can also provide valuable supplemental earning opportunities to these communities. To this end, we conducted a study where we collected labelled speech data in the Marathi language from three different user groups: low-income rural users, low-income urban users, and university students. Overall, we collected 109 hours of data from 36 participants. Our results show that the data collected from low-income participants is of comparable quality to the data collected from university students (who are typically employed to do this work) and that crowdsourcing speech data from low-income rural and urban workers is a viable method of gathering speech data.
@inproceedings{abraham2020crowdsourcing,title={Crowdsourcing speech data for low-resource languages from low-income workers},author={Abraham, Basil and Goel, Danish and Siddarth, Divya and Bali, Kalika and Chopra, Manu and Choudhury, Monojit and Joshi, Pratik and Jyoti, Preethi and Sitaram, Sunayana and Seshadri, Vivek},booktitle={Proceedings of LREC},pages={2819--2826},year={2020},}
IWSLT
Generating fluent translations from disfluent text without access to fluent references: IIT Bombay@ IWSLT2020
Nikhil
Saini, Jyotsana
Khatri, Preethi
Jyothi, and
1 more author
Machine translation systems perform reasonably well when the input is well-formed speech or text. Conversational speech is spontaneous and inherently consists of many disfluencies. Producing fluent translations of disfluent source text would typically require parallel disfluent to fluent training data. However, fluent translations of spontaneous speech are an additional resource that is tedious to obtain. This work describes the submission of IIT Bombay to the Conversational Speech Translation challenge at IWSLT 2020. We specifically tackle the problem of disfluency removal in disfluent-to-fluent text-to-text translation assuming no access to fluent references during training. Common patterns of disfluency are extracted from disfluent references and a noise induction model is used to simulate them starting from a clean monolingual corpus. This synthetically constructed dataset is then considered as a proxy for labeled data during training. We also make use of additional fluent text in the target language to help generate fluent translations. This work uses no fluent references during training and beats a baseline model by a margin of 4.21 and 3.11 BLEU points where the baseline uses disfluent and fluent references, respectively. Index Terms-disfluency removal, machine translation, noise induction, leveraging monolingual data, denoising for disfluency removal.
@inproceedings{saini2020generating,title={Generating fluent translations from disfluent text without access to fluent references: IIT Bombay@ IWSLT2020},author={Saini, Nikhil and Khatri, Jyotsana and Jyothi, Preethi and Bhattacharyya, Pushpak},booktitle={Proceedings of IWSLT},pages={178--186},year={2020},}
2019
ACL
Cross-Lingual Training for Automatic Question Generation
Vishwajeet
Kumar, Nitish
Joshi, Arijit
Mukherjee, and
2 more authors
Automatic question generation (QG) is a challenging problem in natural language understanding. QG systems are typically built assuming access to a large number of training instances where each instance is a question and its corresponding answer. For a new language, such training instances are hard to obtain making the QG problem even more challenging. Using this as our motivation, we study the reuse of an available large QG dataset in a secondary language (e.g. English) to learn a QG model for a primary language (e.g. Hindi) of interest. For the primary language, we assume access to a large amount of monolingual text but only a small QG dataset. We propose a cross-lingual QG model which uses the following training regime: (i) Unsupervised pretraining of language models in both primary and secondary languages and (ii) joint supervised training for QG in both languages. We demonstrate the efficacy of our proposed approach using two different primary languages, Hindi and Chinese. We also create and release a new question answering dataset for Hindi consisting of 6555 sentences.
@inproceedings{kumar2019cross,title={Cross-Lingual Training for Automatic Question Generation},author={Kumar, Vishwajeet and Joshi, Nitish and Mukherjee, Arijit and Ramakrishnan, Ganesh and Jyothi, Preethi},booktitle={Proceedings of ACL},pages={4863--4872},year={2019},}
Interspeech
Exploiting Monolingual Speech Corpora for Code-Mixed Speech Recognition.
Karan
Taneja, Satarupa
Guha, Preethi
Jyothi, and
1 more author
One of the main challenges in building code-mixed ASR systems is the lack of annotated speech data. Often, however, monolingual speech corpora are available in abundance for the languages in the code-mixed speech. In this paper, we explore different techniques that use monolingual speech to create synthetic code-mixed speech and examine their effect on training models for code-mixed ASR. We assume access to a small amount of real code-mixed text, from which we extract probability distributions that govern the transition of phones across languages at code-switch boundaries and the span lengths corresponding to a particular language. We extract segments from monolingual data and concatenate them to form code-mixed utterances such that these probability distributions are preserved. Using this synthetic speech, we show significant improvements in Hindi-English code-mixed ASR performance compared to using synthetic speech naively constructed from complete utterances in different languages. We also present language modelling experiments that use synthetically constructed codemixed text and discuss their benefits.
@inproceedings{taneja2019exploiting,title={Exploiting Monolingual Speech Corpora for Code-Mixed Speech Recognition.},author={Taneja, Karan and Guha, Satarupa and Jyothi, Preethi and Abraham, Basil},booktitle={Proceedings of Interspeech},pages={2150--2154},year={2019}}
2018
EMNLP
Revisiting the Importance of Encoding Logic Rules in Sentiment Classification
We analyze the performance of different sentiment classification models on syntactically complex inputs like A-but-B sentences. The first contribution of this analysis addresses reproducible research: to meaningfully compare different models, their accuracies must be averaged over far more random seeds than what has traditionally been reported. With proper averaging in place, we notice that the distillation model, which incorporates explicit logic rules for sentiment classification, is ineffective. In contrast, using contextualized ELMo embeddings instead of logic rules yields significantly better performance. Additionally, we provide analysis and visualizations that demonstrate ELMo’s ability to implicitly learn logic rules. Finally, a crowdsourced analysis reveals how ELMo outperforms baseline models even on sentences with ambiguous sentiment labels.
@inproceedings{krishna2018revisiting,title={Revisiting the Importance of Encoding Logic Rules in Sentiment Classification},author={Krishna, Kalpesh and Jyothi, Preethi and Iyyer, Mohit},booktitle={Proceedings of EMNLP},pages={4743--4751},year={2018},}
EMNLP
Code-switched Language Models Using Dual RNNs and Same-Source Pretraining
This work focuses on building language models (LMs) for code-switched text. We propose two techniques that significantly improve these LMs: 1) A novel recurrent neural network unit with dual components that focus on each language in the code-switched text separately 2) Pretraining the LM using synthetic text from a generative model estimated using the training data. We demonstrate the effectiveness of our proposed techniques by reporting perplexities on a Mandarin-English task and derive significant reductions in perplexity.
@inproceedings{garg2018code,title={Code-switched Language Models Using Dual RNNs and Same-Source Pretraining},author={Garg, Saurabh and Parekh, Tanmay and Jyothi, Preethi},booktitle={Proceedings of EMNLP},pages={3078--3083},year={2018},}
Interspeech
Improved Accented Speech Recognition Using Accent Embeddings and Multi-task Learning.
One of the major remaining challenges in modern automatic speech recognition (ASR) systems for English is to be able to handle speech from users with a diverse set of accents. ASR systems that are trained on speech from multiple English accents still underperform when confronted with a new speech accent. In this work, we explore how to use accent embeddings and multi-task learning to improve speech recognition for accented speech. We propose a multi-task architecture that jointly learns an accent classifier and a multi-accent acoustic model. We also consider augmenting the speech input with accent information in the form of embeddings extracted by a separate network. These techniques together give significant relative performance improvements of 15% and 10% over a multi-accent baseline system on test sets containing seen and unseen accents, respectively.
@inproceedings{jain2018improved,title={Improved Accented Speech Recognition Using Accent Embeddings and Multi-task Learning.},author={Jain, Abhinav and Upreti, Minali and Jyothi, Preethi},booktitle={Proceedings of Interspeech},pages={2454--2458},year={2018}}
Interspeech
Dual Language Models for Code Switched Speech Recognition
In this work, we present a simple and elegant approach to language modeling for bilingual code-switched text. Since code-switching is a blend of two or more different languages, a standard bilingual language model can be improved upon by using structures of the monolingual language models. We propose a novel technique called dual language models, which involves building two complementary monolingual language models and combining them using a probabilistic model for switching between the two. We evaluate the efficacy of our approach using a conversational Mandarin-English speech corpus. We prove the robustness of our model by showing significant improvements in perplexity measures over the standard bilingual language model without the use of any external information. Similar consistent improvements are also reflected in automatic speech recognition error rates.
@inproceedings{garg2018dual,title={Dual Language Models for Code Switched Speech Recognition},author={Garg, Saurabh and Parekh, Tanmay and Jyothi, Preethi},booktitle={Proceedings of Interspeech},pages={2598--2602},year={2018}}
Interspeech
Time Aggregation Operators for Multi-label Audio Event Detection.
Pankaj
Joshi, Digvijaysingh
Gautam, Ganesh
Ramakrishnan, and
1 more author
In this paper, we present a state-of-the-art system for audio event detection. The labels on the training (and evaluation) data specify the set of events occurring in each audio clip, but neither the time spans nor the order in which they occur. Specifically, our task of weakly supervised learning is the “Detection and Classification of Acoustic Scenes and Events (DCASE) 2017” challenge [5]. We use the winning entry in this challenge given by Xu et al.[10] as our starting point and identify several important modifications that allow us to improve on their results significantly. Our techniques pertain to aggregation and consolidation over time and frequency signals over a (temporal) sequence before decoding the labels. In general, our work is also relevant to other tasks involving learning from weak labeling of sequential data.
@inproceedings{joshi2018time,title={Time Aggregation Operators for Multi-label Audio Event Detection.},author={Joshi, Pankaj and Gautam, Digvijaysingh and Ramakrishnan, Ganesh and Jyothi, Preethi},booktitle={Proceedings of Interspeech},pages={3309--3313},year={2018},}
ICLR
Generalizing Across Domains via Cross-Gradient Training
Shiv
Shankar, Vihari
Piratla, Soumen
Chakrabarti, and
3 more authors
We present CROSSGRAD, a method to use multi-domain training data to learn a classifier that generalizes to new domains. CROSSGRAD does not need an adaptation phase via labeled or unlabeled data, or domain features in the new domain. Most existing domain adaptation methods attempt to erase domain signals using techniques like domain adversarial training. In contrast, CROSSGRAD is free to use domain signals for predicting labels, if it can prevent overfitting on training domains. We conceptualize the task in a Bayesian setting, in which a sampling step is implemented as data augmentation, based on domain-guided perturbations of input instances. CROSSGRAD parallelly trains a label and a domain classifier on examples perturbed by loss gradients of each other’s objectives. This enables us to directly perturb inputs, without separating and re-mixing domain signals while making various distributional assumptions. Empirical evaluation on three different applications where this setting is natural establishes that (1) domain-guided perturbation provides consistently better generalization to unseen domains, compared to generic instance perturbation methods, and that (2) data augmentation is a more stable and accurate method than domain adversarial training.
@inproceedings{shankar2018generalizing,title={Generalizing Across Domains via Cross-Gradient Training},author={Shankar, Shiv and Piratla, Vihari and Chakrabarti, Soumen and Chaudhuri, Siddhartha and Jyothi, Preethi and Sarawagi, Sunita},booktitle={Proceedings of ICLR},year={2018},}
2017
ASRU
Leveraging native language speech for accent identification using deep siamese networks
Aditya
Siddhant, Preethi
Jyothi, and Sriram
Ganapathy
The problem of automatic accent identification is important for several applications like speaker profiling and recognition as well as for improving speech recognition systems. The accented nature of speech can be primarily attributed to the influence of the speaker’s native language on the given speech recording. In this paper, we propose a novel accent identification system whose training exploits speech in native languages along with the accented speech. Specifically, we develop a deep Siamese network based model which learns the association between accented speech recordings and the native language speech recordings. The Siamese networks are trained with i-vector features extracted from the speech recordings using either an unsupervised Gaussian mixture model (GMM) or a supervised deep neural network (DNN) model. We perform several accent identification experiments using the CSLU Foreign Accented English (FAE) corpus. In these experiments, our proposed approach using deep Siamese networks yield significant relative performance improvements of 15.4% on a 10-class accent identification task, over a baseline DNN-based classification system that uses GMM i-vectors. Furthermore, we present a detailed error analysis of the proposed accent identification system.
@inproceedings{siddhant2017leveraging,title={Leveraging native language speech for accent identification using deep siamese networks},author={Siddhant, Aditya and Jyothi, Preethi and Ganapathy, Sriram},booktitle={Proceedings of ASRU},pages={621--628},year={2017},}
Asilomar
Mismatched crowdsourcing: Mining latent skills to acquire speech transcriptions
Mark
Hasegawa-Johnson, Preethi
Jyothi, Wenda
Chen, and
1 more author
Automatic speech recognition (ASR) converts audio to text. ASR is usually trained using a large quantity of labeled data, i.e., audio with text transcription. In many languages, however, text transcription is hard to find, e.g., in both Hokkien and Dinka, we found native speakers who had received all their primary education in some other language, and who therefore had difficulty writing in their own language. Fortunately, speech in every language is produced by human mouths, and designed to be interpreted by human ears. Speakers of a majority language (English, say, or Mandarin Chinese) are therefore able to make some sense of even the strangest language (Zulu, say, or Cantonese): language-unique distinctions are mostly lost, but universal distinctions such as consonant versus vowel are, for the most part, correctly transmitted. We can decode such mismatched transcripts using an information-theoretic decoder, resulting in a low-entropy probability distribution over the possible native-language transcriptions. Mismatched transcripts can be used to train ASR. Combining ten hours of mismatched transcripts with 12–48 minutes of native transcripts, if available, results in lower phone error rate. On the other hand, if we don’t even know the native phoneme inventory, mismatched transcripts in two or more annotation languages can be used to infer the native phoneme inventory (with entropy depending on the distinctive feature inventory of the annotation languages).
@inproceedings{hasegawa2017mismatched,title={Mismatched crowdsourcing: Mining latent skills to acquire speech transcriptions},author={Hasegawa-Johnson, Mark and Jyothi, Preethi and Chen, Wenda and Do, Van Hai},booktitle={Proceedings of Asilomar},pages={1277--1281},year={2017},}
ICASSP
Low-resource grapheme-to-phoneme conversion using recurrent neural networks
Grapheme-to-phoneme (G2P) conversion is an important problem for many speech and language processing applications. G2P models are particularly useful for low-resource languages that do not have well-developed pronunciation lexicons. Prominent G2P paradigms are based on initial alignments between grapheme and phoneme sequences. In this work, we devise new alignment strategies that work effectively with recurrent neural network based models when only a small number of pronunciations are available to train the models. In a small data setting, we build G2P models for Pashto, Tagalog and Lithuanian that significantly outperform a joint sequence model and a baseline recurrent neural network based model, giving up to 14% and 9% relative reductions in phone and word error rates when trained on a dataset of 250 words.
@inproceedings{jyothi2017low,title={Low-resource grapheme-to-phoneme conversion using recurrent neural networks},author={Jyothi, Preethi and Hasegawa-Johnson, Mark},booktitle={Proceedings of ICASSP},pages={5030--5034},year={2017},}
2016
TASL
ASR for under-resourced languages from probabilistic transcription
Mark A
Hasegawa-Johnson, Preethi
Jyothi, Daniel
McCloy, and
8 more authors
IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016
In many under-resourced languages it is possible to find text, and it is possible to find speech, but transcribed speech suitable for training automatic speech recognition (ASR) is unavailable. In the absence of native transcripts, this paper proposes the use of a probabilistic transcript: A probability mass function over possible phonetic transcripts of the waveform. Three sources of probabilistic transcripts are demonstrated. First, self-training is a well-established semisupervised learning technique, in which a cross-lingual ASR first labels unlabeled speech, and is then adapted using the same labels. Second, mismatched crowdsourcing is a recent technique in which nonspeakers of the language are asked to write what they hear, and their nonsense transcripts are decoded using noisy channel models of second-language speech perception. Third, EEG distribution coding is a new technique in which nonspeakers of the language listen to it, and their electrocortical response signals are interpreted to indicate probabilities. ASR was trained in four languages without native transcripts. Adaptation using mismatched crowdsourcing significantly outperformed self-training, and both significantly outperformed a cross-lingual baseline. Both EEG distribution coding and text-derived phone language models were shown to improve the quality of probabilistic transcripts derived from mismatched crowdsourcing.
@article{hasegawa2016asr,title={ASR for under-resourced languages from probabilistic transcription},author={Hasegawa-Johnson, Mark A and Jyothi, Preethi and McCloy, Daniel and Mirbagheri, Majid and Di Liberto, Giovanni M and Das, Amit and Ekin, Bradley and Liu, Chunxi and Manohar, Vimal and Tang, Hao and others},journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},volume={25},number={1},pages={50--63},year={2016},}
COLING Workshop
Clustering-based phonetic projection in mismatched crowdsourcing channels for low-resourced ASR
Wenda
Chen, Mark
Hasegawa-Johnson, Nancy
Chen, and
2 more authors
In Proceedings of the Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016), COLING, 2016
Acquiring labeled speech for low-resource languages is a difficult task in the absence of native speakers of the language. One solution to this problem involves collecting speech transcriptions from crowd workers who are foreign or non-native speakers of a given target language. From these mismatched transcriptions, one can derive probabilistic phone transcriptions that are de- fined over the set of all target language phones using a noisy channel model. This paper extends prior work on deriving probabilistic transcriptions (PTs) from mismatched transcriptions by 1) modelling multilingual channels and 2) introducing a clustering-based phonetic mapping tech- nique to improve the quality of PTs. Mismatched crowdsourcing for multilingual channels has certain properties of projection mapping, e.g., it can be interpreted as a clustering based on singu- lar value decomposition of the segment alignments. To this end, we explore the use of distinctive feature weights, lexical tone confusions, and a two-step clustering algorithm to learn projections of phoneme segments from mismatched multilingual transcriber languages to the target language. We evaluate our techniques using mismatched transcriptions for Cantonese speech acquired from native English and Mandarin speakers. We observe a 5–9% relative reduction in phone error rate for the predicted Cantonese phone transcriptions using our proposed techniques compared with the previous PT method.
@inproceedings{chen2016clustering,title={Clustering-based phonetic projection in mismatched crowdsourcing channels for low-resourced ASR},author={Chen, Wenda and Hasegawa-Johnson, Mark and Chen, Nancy and Jyothi, Preethi and Varshney, Lav},booktitle={Proceedings of the Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016), COLING},pages={133--141},year={2016},}
Interspeech
Automatic Speech Recognition Using Probabilistic Transcriptions in Swahili, Amharic, and Dinka.
Amit
Das, Preethi
Jyothi, and Mark
Hasegawa-Johnson
In this study, we develop automatic speech recognition systems for three sub-Saharan African languages using probabilistic transcriptions collected from crowd workers who neither speak nor have any familiarity with the African languages. The three African languages in consideration are Swahili, Amharic, and Dinka. There is a language mismatch in this scenario. More specifically, utterances spoken in African languages were transcribed by crowd workers who were mostly native speakers of English. Due to this, such transcriptions are highly prone to label inaccuracies. First, we use a recently introduced technique called mismatched crowdsourcing which processes the raw crowd transcriptions to confusion networks. Next, we adapt both multilingual hidden Markov models (HMM) and deep neural network (DNN) models using the probabilistic transcriptions of the African languages. Finally, we report the results using both deterministic and probabilistic phone error rates (PER). Automatic speech recognition systems developed using this recipe are particularly useful for low resource languages where there is limited access to linguistic resources and/or transcribers in the native language.
@inproceedings{das2016automatic,title={Automatic Speech Recognition Using Probabilistic Transcriptions in Swahili, Amharic, and Dinka.},author={Das, Amit and Jyothi, Preethi and Hasegawa-Johnson, Mark},booktitle={Proceedings of Interspeech},pages={3524--3528},year={2016},}
SLTU
Performance improvement of probabilistic transcriptions with language-specific constraints
Xiang
Kong, Preethi
Jyothi, and Mark
Hasegawa-Johnson
This article describes a method for reducing the error rate of probabilistic phone-based transcriptions resulting from mismatched crowdsourcing by using language-specific constraints to post-process the phone sequence. In the scenario under consideration, there are no native-language transcriptions or pronunciation dictionary available in the test language; instead, available resources include non-native transcriptions, a rudimentary rule-based G2P, and a list of orthographic word forms mined from the internet. The proposed solution post-processes non-native transcriptions by converting them to test-language orthography, composing with testlanguage word forms, then converting back to a phone string. Experiments demonstrate that the phone error rate of the transcription is reduced, using this method, by 22% on an independent evaluation-test dataset.
@article{kong2016performance,title={Performance improvement of probabilistic transcriptions with language-specific constraints},author={Kong, Xiang and Jyothi, Preethi and Hasegawa-Johnson, Mark},journal={Proceedings of SLTU Workshop},volume={81},pages={30--36},year={2016},}
CSL
Articulatory feature-based pronunciation modeling
Karen
Livescu, Preethi
Jyothi, and Eric
Fosler-Lussier
Spoken language, especially conversational speech, is characterized by great variability in word pronunciation, including many variants that differ grossly from dictionary prototypes. This is one factor in the poor performance of automatic speech recognizers on conversational speech, and it has been very difficult to mitigate in traditional phone- based approaches to speech recognition. An alternative approach, which has been studied by ourselves and others, is one based on sub-phonetic features rather than phones. In such an approach, a word’s pronunciation is represented as multiple streams of phonological features rather than a single stream of phones. Features may correspond to the positions of the speech articulators, such as the lips and tongue, or may be more abstract categories such as manner and place.
This article reviews our work on a particular type of articulatory feature-based pronunciation model. The model allows for asynchrony between features, as well as per-feature substitutions, making it more natural to account for many pronunciation changes that are difficult to handle with phone-based models. Such models can be efficiently represented as dynamic Bayesian networks. The feature-based models improve significantly over phone-based coun- terparts in terms of frame perplexity and lexical access accuracy. The remainder of the article discusses related work and future directions.
@article{livescu2016articulatory,title={Articulatory feature-based pronunciation modeling},author={Livescu, Karen and Jyothi, Preethi and Fosler-Lussier, Eric},journal={Computer Speech \& Language},volume={36},pages={212--232},year={2016},}
ICASSP
Adapting ASR for under-resourced languages using mismatched transcriptions
*Chunxi
Liu, *Preethi
Jyothi, Hao
Tang, and
5 more authors
In Proceedings of ICASSP This work received an Speech and Language Processing Student Paper Award
, 2016
Mismatched transcriptions of speech in a target language refers to transcriptions provided by people unfamiliar with the language, using English letter sequences. In this work, we demonstrate the value of such transcriptions in building an ASR system for the target language. For different languages, we use less than an hour of mismatched transcriptions to successfully adapt baseline multilingual models built with no access to native transcriptions in the target language. The adapted models provide up to 25% relative improvement in phone error rates on an unseen evaluation set.
@inproceedings{liu2016adapting,title={Adapting ASR for under-resourced languages using mismatched transcriptions},author={Liu, Chunxi and Jyothi, Preethi and Tang, Hao and Manohar, Vimal and Sloan, Rose and Kekona, Tyler and Hasegawa-Johnson, Mark and Khudanpur, Sanjeev},booktitle={Proceedings of ICASSP},pages={5840--5844},year={2016},}
ITA
Language coverage for mismatched crowdsourcing
Lav R
Varshney, Preethi
Jyothi, and Mark
Hasegawa-Johnson
In 2016 Information Theory and Applications Workshop (ITA), 2016
Developing automatic speech recognition technologies requires transcribed speech so as to learn the mapping from sound to text. It is traditionally assumed that transcribers need to be native speakers of the language being transcribed. Mismatched crowdsourcing is the transcription of speech by crowd workers who do not speak the language. Given there are phonological similarities among different human languages, mismatched crowdsourcing does provide noisy data that can be aggregated to yield reliable labels. Here we discuss phonological properties of different languages in a coding-theoretic framework, and how nonnative phoneme misperception can be modeled as a noisy communication channel. We show the results of experiments demonstrating the efficacy of this information theory inspired modeling approach, having native English speakers and native Mandarin speakers transcribe Cantonese speech. Finally we discuss how crowd workers whose native language background give them the highest probability of faithful transcription can be found by solving a weighted set cover problem.
@inproceedings{varshney2016language,title={Language coverage for mismatched crowdsourcing},author={Varshney, Lav R and Jyothi, Preethi and Hasegawa-Johnson, Mark},booktitle={2016 Information Theory and Applications Workshop (ITA)},pages={1--9},year={2016},}
2015
Interspeech
Transcribing continuous speech using mismatched crowdsourcing.
Mismatched crowdsourcing derives speech transcriptions using crowd workers unfamiliar with the language being spoken. This approach has been demonstrated for isolated word transcription tasks, but never yet for continuous speech. In this work, we demonstrate mismatched crowdsourcing of continuous speech with a word error rate of under 45% in a large-vocabulary transcription task of short speech segments. In order to scale mismatched crowdsourcing to continuous speech, we propose a number of new WFST pruning techniques based on explicitly low-entropy models of the acoustic similarities among orthographic symbols as understood within a transcriber community. We also provide an information-theoretic analysis and estimate the amount of information lost in transcription by the mismatched crowd workers to be under 5 bits.
@inproceedings{jyothi2015transcribing,title={Transcribing continuous speech using mismatched crowdsourcing.},author={Jyothi, Preethi and Hasegawa-Johnson, Mark},booktitle={Proceedings of Interspeech},pages={2774--2778},year={2015},}
Interspeech
Improved hindi broadcast ASR by adapting the language model and pronunciation model using a priori syntactic and morphophonemic knowledge.
In this work, we present a new large-vocabulary, broadcast news ASR system for Hindi. Since Hindi has a largely phonemic orthography, the pronunciation model was automatically generated from text. We experiment with several variants of this model and study the effect of incorporating word boundary information with these models. We also experiment with knowledge-based adaptations to the language model in Hindi, derived in an unsupervised manner, that lead to small improvements in word error rate (WER). Our experiments were conducted on a new corpus assembled from publicly-available Hindi news broadcasts. We evaluate our techniques on an openvocabulary task and obtain competitive WERs on an unseen test set.
@inproceedings{jyothi2015improved,title={Improved hindi broadcast ASR by adapting the language model and pronunciation model using a priori syntactic and morphophonemic knowledge.},author={Jyothi, Preethi and Hasegawa-Johnson, Mark},booktitle={Proceedings of Interspeech},pages={3164--3168},year={2015},}
AAAI
Acquiring speech transcriptions using mismatched crowdsourcing
Transcribed speech is a critical resource for building statistical speech recognition systems. Recent work has looked towards soliciting transcriptions for large speech corpora from native speakers of the language using crowdsourcing techniques. However, native speakers of the target language may not be readily available for crowdsourcing. We examine the following question: can humans unfamiliar with the target language help transcribe? We follow an information-theoretic approach to this problem:(1) We learn the characteristics of a noisy channel that models the transcribers’ systematic perception biases.(2) We use an error-correcting code, specifically a repetition code, to encode the inputs to this channel, in conjunction with a maximum-likelihood decoding rule. To demonstrate the feasibility of this approach, we transcribe isolated Hindi words with the help of Mechanical Turk workers unfamiliar with Hindi. We successfully recover Hindi words with an accuracy of over 85% (and 94% in a 4-best list) using a 15-fold repetition code. We also estimate the conditional entropy of the input to this channel (Hindi words) given the channel output (transcripts from crowdsourced workers) to be less than 2 bits; this serves as a theoretical estimate of the average number of bits of auxiliary information required for errorless recovery.
@inproceedings{jyothi2015acquiring,title={Acquiring speech transcriptions using mismatched crowdsourcing},author={Jyothi, Preethi and Hasegawa-Johnson, Mark},booktitle={Proceedings of AAAI},volume={29},number={1},year={2015},}
LabPhon
Models of dataset size, question design, and cross-language speech perception for speech crowdsourcing applications
Mark
Hasegawa-Johnson, Jennifer
Cole, Preethi
Jyothi, and
1 more author
Transcribers make mistakes. Workers recruited in a crowdsourcing marketplace, because of their varying levels of commitment and education, make more mistakes than workers in a controlled laboratory setting. Methods for compensating transcriber mistakes are desirable because, with such methods available, crowdsourcing has the potential to significantly increase the scale of experiments in laboratory phonology. This paper provides a brief tutorial on statistical learning theory, introducing the relationship between dataset size and estimation error, then presents a theoretical description and preliminary results for two new methods that control labeler error in laboratory phonology experiments. First, we discuss the method of crowdsourcing over error-correcting codes. In the error-correcting-code method, each difficult labeling task is first factored, by the experimenter, into the product of several easy labeling tasks (typically binary). Factoring increases the total number of tasks, nevertheless it results in faster completion and higher accuracy, because workers unable to perform the difficult task may be able to meaningfully contribute to the solution of each easy task. Second, we discuss the use of explicit mathematical models of the errors made by a worker in the crowd. In particular, we introduce the method of mismatched crowdsourcing, in which workers transcribe a language they do not understand, and an explicit mathematical model of second-language phoneme perception is used to learn and then compensate their transcription errors. Though introduced as technologies that increase the scale of phonology experiments, both methods have implications beyond increased scale. The method of easy questions permits us to probe the perception, by untrained listeners, of complicated phonological models; examples are provided from the prosody of English and Hindi. The method of mismatched crowdsourcing permits us to probe, in more detail than ever before, the perception of phonetic categories by listeners with a different phonological system.
@article{hasegawa2015models,title={Models of dataset size, question design, and cross-language speech perception for speech crowdsourcing applications},author={Hasegawa-Johnson, Mark and Cole, Jennifer and Jyothi, Preethi and Varshney, Lav R},journal={Laboratory Phonology},volume={6},number={3-4},pages={381--431},year={2015},}
ICPhS
Prosodic and structural correlates of perceived prominence in Russian and Hindi.
Tatiana
Luchkina, Jennifer S
Cole, Preethi
Jyothi, and
1 more author
Perceived prominence in Russian and Hindi, free word order languages, can be communicated prosodically and structurally, via word order. Paired production and perception experiments with native speakers show that discourse-prominent constituents are marked acoustically, via a perceptible increase in vowel intensity and f0, and structurally, via a change in word order and placing a word into a designated position in a sentence or clause.
@inproceedings{luchkina2015prosodic,title={Prosodic and structural correlates of perceived prominence in Russian and Hindi.},author={Luchkina, Tatiana and Cole, Jennifer S and Jyothi, Preethi and Puri, Vandana},booktitle={Proceedings of ICPhS},year={2015}}
2014
SIGMORPHON
Revisiting word neighborhoods for speech recognition
Preethi
Jyothi and Karen
Livescu
In Proceedings of the 2014 Joint Meeting of SIGMORPHON and SIGFSM, 2014
Word neighborhoods have been suggested but not thoroughly explored as an explanatory variable for errors in automatic speech recognition (ASR). We revisit the definition of word neighborhoods, propose new measures using a fine-grained articulatory representation of word pronunciations, and consider new neighbor weighting functions. We analyze the significance of our measures as predictors of errors in an isolated-word ASR system and a continuous-word ASR system. We find that our measures are significantly better predictors of ASR errors than previously used neighborhood density measures.
@inproceedings{jyothi2014revisiting,title={Revisiting word neighborhoods for speech recognition},author={Jyothi, Preethi and Livescu, Karen},booktitle={Proceedings of the 2014 Joint Meeting of SIGMORPHON and SIGFSM},pages={1--9},year={2014},}
SpeechProsody
An investigation of prosody in Hindi narrative speech
Preethi
Jyothi, Jennifer
Cole, Mark
Hasegawa-Johnson, and
1 more author
This paper investigates how prosodic elements such as prominences and prosodic boundaries in Hindi are perceived. We approach this using data from three sources:(i) native speakers of Hindi without any linguistic expertise (ii) a linguistically trained expert in Hindi prosody and finally,(iii) classifiers trained on English for automatic prominence and boundary detection. We use speech from a corpus of Hindi narrative speech for our experiments. Our results indicate that non-expert transcribers do not have a consistent notion of prosodic prominences. However, they show considerable agreement regarding the placement of prosodic boundaries. Also, relative to the nonexpert transcribers, there is higher agreement between the expert transcriber and the automatically derived labels for prominence (and prosodic boundaries); this suggests the possibility of using classifiers for automatic prediction of these prosodic events in Hindi.
@inproceedings{jyothi2014investigation,title={An investigation of prosody in Hindi narrative speech},author={Jyothi, Preethi and Cole, Jennifer and Hasegawa-Johnson, Mark and Puri, Vandana},booktitle={Proceedings of Speech Prosody},volume={7},pages={623--627},year={2014}}
2013
Interspeech
Discriminative training of WFST factors with application to pronunciation modeling.
Preethi
Jyothi, Eric
Fosler-Lussier, and Karen
Livescu
One of the most popular speech recognition architectures consists of multiple components (like the acoustic, pronunciation and language models) that are modeled as weighted finite state transducer (WFST) factors in a cascade. These factor WFSTs are typically trained in isolation and combined efficiently for decoding. Recent work has explored jointly estimating parameters for these models using considerable amounts of training data. We propose an alternative approach to selectively train factor WFSTs in such an architecture, while still leveraging information from the entire cascade. This technique allows us to effectively estimate parameters of a factor WFST using relatively small amounts of data, if the factor is small. Our approach involves an online training paradigm for linear models adapted for discriminatively training one or more WFSTs in a cascade. We apply this method to train a pronunciation model for recognition on conversational speech, resulting in significant improvements in recognition performance over the baseline model.
@inproceedings{jyothi2013discriminative,title={Discriminative training of WFST factors with application to pronunciation modeling.},author={Jyothi, Preethi and Fosler-Lussier, Eric and Livescu, Karen},booktitle={Proceedings of Interspeech},pages={1961--1965},year={2013},}
IEEE
Conditional random fields in speech, audio, and language processing
Eric
Fosler-Lussier, Yanzhang
He, Preethi
Jyothi, and
1 more author
Conditional random fields (CRFs) are probabilistic sequence models that have been applied in the last decade to a number of applications in audio, speech, and language processing. In this paper, we provide a tutorial overview of CRF technologies, pointing to other resources for more in-depth discussion; in particular, we describe the common linear-chain model as well as a number of common extensions within the CRF family of models. An overview of the mathematical techniques used in training and evaluating these models is also provided, as well as a discussion of the relationships with other probabilistic models. Finally, we survey recent work in speech, audio, and language processing to show how the same CRF technology can be deployed in different scenarios.
@article{fosler2013conditional,title={Conditional random fields in speech, audio, and language processing},author={Fosler-Lussier, Eric and He, Yanzhang and Jyothi, Preethi and Prabhavalkar, Rohit},journal={Proceedings of the IEEE},volume={101},number={5},pages={1054--1075},year={2013},}
2012
Interspeech
Discriminatively learning factorized finite state pronunciation models from dynamic Bayesian networks.
Preethi
Jyothi, Eric
Fosler-Lussier, and Karen
Livescu
In Proceedings of Interspeech This work received a Best Student Paper Award
, 2012
This paper describes an approach to efficiently construct, and discriminatively train, a weighted finite state transducer (WFST) representation for an articulatory feature-based model of pronunciation. This model is originally implemented as a dynamic Bayesian network (DBN). The work is motivated by a desire to (1) incorporate such a pronunciation model in WFST-based recognizers, and to (2) learn discriminative models that are more general than the DBNs. The approach is quite general, though here we show how it applies to a specific model. We use the conditional independence assumptions imposed by the DBN to efficiently convert it into a sequence of WFSTs (factor FSTs) which, when composed, yield the same model as the DBN. We then introduce a linear model of the arc weights of the factor FSTs and discriminatively learn its weights using the averaged perceptron algorithm. We demonstrate the approach using a lexical access task in which we recognize a word given its surface realization. Our experimental results using a phonetically transcribed subset of the Switchboard corpus show that the discriminatively learned model performs significantly better than the original DBN.
@inproceedings{jyothi2012discriminatively,title={Discriminatively learning factorized finite state pronunciation models from dynamic Bayesian networks.},author={Jyothi, Preethi and Fosler-Lussier, Eric and Livescu, Karen},booktitle={Proceedings of Interspeech},pages={1063--1066},year={2012}}
NAACL Workshop
Large-scale discriminative language model reranking for voice-search
Preethi
Jyothi, Leif
Johnson, Ciprian
Chelba, and
1 more author
In Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model?, 2012
We present a distributed framework for largescale discriminative language models that can be integrated within a large vocabulary continuous speech recognition (LVCSR) system using lattice rescoring. We intentionally use a weakened acoustic model in a baseline LVCSR system to generate candidate hypotheses for voice-search data; this allows us to utilize large amounts of unsupervised data to train our models. We propose an efficient and scalable MapReduce framework that uses a perceptron-style distributed training strategy to handle these large amounts of data. We report small but significant improvements in recognition accuracies on a standard voice-search data set using our discriminative reranking model. We also provide an analysis of the various parameters of our models including model size, types of features, size of partitions in the MapReduce framework with the help of supporting experiments.
@inproceedings{jyothi2012large,title={Large-scale discriminative language model reranking for voice-search},author={Jyothi, Preethi and Johnson, Leif and Chelba, Ciprian and Strope, Brian},booktitle={Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model?},pages={41--49},year={2012},}
ICASSP
Distributed discriminative language models for Google voice-search
Preethi
Jyothi, Leif
Johnson, Ciprian
Chelba, and
1 more author
This paper considers large-scale linear discriminative language models trained using a distributed perceptron algorithm. The algorithm is implemented efficiently using a MapReduce/SSTable framework. This work also introduces the use of large amounts of unsupervised data (confidence filtered Google voice-search logs) in conjunction with a novel training procedure that regenerates word lattices for the given data with a weaker acoustic model than the one used to generate the unsupervised transcriptions for the logged data. We observe small but statistically significant improvements in recognition performance after reranking N-best lists of a standard Google voice-search data set.
@inproceedings{jyothi2012distributed,title={Distributed discriminative language models for Google voice-search},author={Jyothi, Preethi and Johnson, Leif and Chelba, Ciprian and Strope, Brian},booktitle={Proceedings of ICASSP},pages={5017--5020},year={2012}}
2011
ICASSP
Lexical access experiments with context-dependent articulatory feature-based models
Preethi
Jyothi, Karen
Livescu, and Eric
Fosler-Lussier
We address the problem of pronunciation variation in conversational speech with a context-dependent articulatory feature-based model. The model is an extension of previous work using dynamic Bayesian networks, which allow for easy factorization of a state into multiple variables representing the articulatory features. We build context-dependent decision trees for the articulatory feature distributions, which are incorporated into the dynamic Bayesian networks, and experiment with different sets of context variables. We evaluate our models on a lexical access task using a phonetically transcribed subset of the Switchboard corpus. We find that our models outperform a context-dependent phonetic baseline.
@inproceedings{jyothi2011lexical,title={Lexical access experiments with context-dependent articulatory feature-based models},author={Jyothi, Preethi and Livescu, Karen and Fosler-Lussier, Eric},booktitle={Proceedings of ICASSP},pages={4900--4903},year={2011},}
2010
Interspeech
Discriminative language modeling using simulated ASR errors.
In this paper, we approach the problem of discriminatively training language models using a weighted finite state transducer (WFST) framework that does not require acoustic training data. The phonetic confusions prevalent in the recognizer are modeled using a confusion matrix that takes into account information from the pronunciation model (word-based phone confusion log likelihoods) and information from the acoustic model (distances between the phonetic acoustic models). This confusion matrix, within the WFST framework, is used to generate confusable word graphs that serve as inputs to the averaged perceptron algorithm to train the parameters of the discriminative language model. Experiments on a large vocabulary speech recognition task show significant word error rate reductions when compared to a baseline using a trigram model trained with the maximum likelihood criterion.
@inproceedings{jyothi2010discriminative,title={Discriminative language modeling using simulated ASR errors.},author={Jyothi, Preethi and Fosler-Lussier, Eric},booktitle={Proceedings of Interspeech},pages={1049--1052},year={2010},}
NAACL
Investigations into the Crandem approach to word recognition
Rohit
Prabhavalkar, Preethi
Jyothi, William
Hartmann, and
2 more authors
We suggest improvements to a previously proposed framework for integrating Conditional Random Fields and Hidden Markov Models, dubbed a Crandem system (2009). The previous authors’ work suggested that local label posteriors derived from the CRF were too low-entropy for use in word-level automatic speech recognition. As an alternative to the log posterior representation used in their system, we explore frame-level representations derived from the CRF feature functions. We also describe a weight normalization transformation that leads to increased entropy of the CRF posteriors. We report significant gains over the previous Crandem system on the Wall Street Journal word recognition task.
@inproceedings{prabhavalkar2010investigations,title={Investigations into the Crandem approach to word recognition},author={Prabhavalkar, Rohit and Jyothi, Preethi and Hartmann, William and Morris, Jeremy and Fosler-Lussier, Eric},booktitle={Proceedings of NAACL},pages={725--728},year={2010},}
2009
Interspeech
A comparison of audio-free speech recognition error prediction methods.
Predicting possible speech recognition errors can be invaluable for a number of Automatic Speech Recognition (ASR) applications. In this study, we extend a Weighted Finite State Transducer (WFST) framework for error prediction to facilitate a comparison between two approaches of predicting confusable words: examining recognition errors on the training set to learn phone confusions and utilizing distances between the phonetic acoustic models for the prediction task. We also expand the framework to deal with continuous word recognition and we can accurately predict 60% of the misrecognized sentences (with an average words-per-sentence count of 15) and a little over 70% of the total number of errors from the unseen test data where no acoustic information related to the test data is utilized.
@inproceedings{jyothi2009comparison,title={A comparison of audio-free speech recognition error prediction methods.},author={Jyothi, Preethi and Fosler-Lussier, Eric},booktitle={Proceedings of Interspeech},pages={1211--1214},year={2009}}