publications | Preethi Jyothi

2025

NeurIPS
BlockDecoder: Boosting ASR Decoders with Context and Merger Modules

Darshan Prabhu and Preethi Jyothi

In Accepted to NeurIPS, 2025

Abs Bib

Attention-based encoder decoder models are the backbone of state-of-the-art architectures for automatic speech recognition (ASR). These models combine a powerful encoder that extracts rich acoustic features with a decoder that autoregressively produces the ASR output. The decoder handles two critical tasks: (1) building rich text-only context and (2) merging acoustic information from the encoder to ensure the predictions remain faithful to the audio. We observe a systematic pattern across the attention distributions of decoder layers in prior architectures: the initial layers direct most attention towards building textual context, while the latter layers focus largely on merging acoustic and textual information to make accurate predictions. Leveraging this key insight, we propose BlockDecoder, a novel decoder architecture comprising two distinct components: a TextEncoder that is purely text-based, and a Merger that autoregressively generates tokens in blockwise fashion by combining information from the audio encoder and TextEncoder. These two components of BlockDecoder collectively result in substantial latency gains. Across diverse datasets, languages and speech tasks, we demonstrate that our proposed BlockDecoder achieves a significant speedup ( 2x) compared to traditional decoders, without any degradation in performance.
@inproceedings{prabhu-etal-2025-blockdecoder, title = {BlockDecoder: Boosting ASR Decoders with Context and Merger Modules}, author = {Prabhu, Darshan and Jyothi, Preethi}, booktitle = {Accepted to NeurIPS}, year = {2025} }
EMNLP
DeFT-X: Denoised Sparse Fine-Tuning for Zero-Shot Cross-Lingual Transfer

Sona Elza Simon and Preethi Jyothi

In Accepted to EMNLP, 2025

Abs Bib

Effective cross-lingual transfer remains a critical challenge in scaling the benefits of large language models from high-resource to low-resource languages. Towards this goal, prior studies have explored many approaches to combine task knowledge from task-specific data in a (high-resource) source language and language knowledge from unlabeled text in a (low-resource) target language. One notable approach proposed composable sparse fine-tuning (SFT) for cross-lingual transfer that learns task-specific and language-specific sparse masks to select a subset of the pretrained model’s parameters that are further fine-tuned. These sparse fine-tuned vectors (SFTs) are subsequently composed with the pretrained model to facilitate zero-shot cross-lingual transfer to a task in a target language, using only task-specific data from a source language. These sparse masks for SFTs were identified using a simple magnitude-based pruning. In our work, we introduce DeFT-X, a novel composable SFT approach that denoises the weight matrices of a pretrained model before magnitude pruning using singular value decomposition, thus yielding more robust SFTs. We evaluate DeFT-X on a diverse set of extremely low-resource languages for sentiment classification (NusaX) and natural language inference (AmericasNLI) and demonstrate that it performs at par or outperforms SFT and other prominent cross-lingual transfer baselines.
@inproceedings{simon-etal-2025-deftx, title = {DeFT-X: Denoised Sparse Fine-Tuning for Zero-Shot Cross-Lingual Transfer}, author = {Simon, Sona Elza and Jyothi, Preethi}, booktitle = {Accepted to EMNLP}, year = {2025} }
EMNLP
RECAST: Retrieval-Augmented Contextual ASR via Decoder-State Keyword Spotting

Ashish Mittal, Sunita Sarawagi, and Preethi Jyothi

In Accepted to EMNLP, 2025

Abs Bib

Contextual biasing in ASR systems is critical for recognizing rare, domain-specific terms but becomes impractical with large keyword dictionaries due to prompt size and latency constraints. We present RECAST–a lightweight retrieval-augmented approach that repurposes decoder states of a pretrained ASR model to retrieve relevant keywords without requiring audio exemplars. RECAST introduces a contrastively trained retriever that aligns decoder-state embeddings with textual keyword representations, enabling fast token-level retrieval over large dictionaries. Retrieved keywords are ranked and formatted into a prompt to guide a downstream speech language model. Trained solely on LibriSpeech and evaluated on out-of-domain benchmarks covering up to 4,000 keywords across diverse domains, RECAST consistently outperforms full-list prompt biasing and strong phonetic/text baselines. It achieves up to 54.3% relative reduction in entity WER and 41.3% overall WER improvement over the baseline, along with up to 2.5x higher recall in challenging settings. Furthermore, RECAST remains effective for diverse languages such as Hindi, demonstrating its scalability, language-agnostic design, and practicality for real-world contextual ASR.
@inproceedings{mittal-etal-2025-recast, title = {RECAST: Retrieval-Augmented Contextual ASR via Decoder-State Keyword Spotting}, author = {Mittal, Ashish and Sarawagi, Sunita and Jyothi, Preethi}, booktitle = {Accepted to EMNLP}, year = {2025} }
EMNLP
LASER: An LLM-based ASR Scoring and Evaluation Rubric

Amrutha Parulekar and Preethi Jyothi

In Accepted to EMNLP, 2025

Abs Bib

Standard ASR evaluation metrics like Word Error Rate (WER) tend to unfairly penalize morphological and syntactic nuances that do not significantly alter sentence semantics. We introduce an LLM-based scoring rubric LASER that leverages state-of-the-art LLMs’ in-context learning abilities to learn from prompts with detailed examples. Hindi LASER scores using Gemini 2.5 Pro achieved a very high correlation score of 94% with human annotations. Hindi examples in the prompt were also effective in analyzing errors in other Indian languages such as Marathi, Kannada and Malayalam. We also demonstrate how a smaller LLM like Llama 3 can be finetuned on word-pair examples derived from reference and ASR predictions to predict what kind of penalty should be applied with close to 89% accuracy.
@inproceedings{parulekar-etal-2025-laser, title = {LASER: An LLM-based ASR Scoring and Evaluation Rubric}, author = {Parulekar, Amrutha and Jyothi, Preethi}, booktitle = {Accepted to EMNLP}, year = {2025} }
Interspeech
SKIP-SALSA: Skip Synchronous Fusion of ASR LLM Decoders

*Ashish Mittal, *Darshan Prabhu, Sunita Sarawagi, and 1 more author

In Proceedings of Interspeech, 2025

Abs Bib PDF

The integration of large language models (LLMs) with ASR is increasingly explored, but remains challenging for low-resource languages. Loose coupling via N-best lists fails due to high ASR errors, while tight coupling using audio tokens requires too much data. A promising middle ground SALSA enables synchronous decoding by cascading ASR and LLM decoders via projection layers, overcoming differing tokenizations. In this work, we show that SALSA fails when the ASR and LLM tokenizations have a large token fertility gap. This problem particularly plagues low-resource languages; the ASR decoder overtokenizes LLM tokens starving the LLM decoder of sufficient audio context. To address this, we propose skipsalsa, that adaptively skips ahead and advances the ASR decoder states to synchronize with the LLM. The skip size is learned via a lightweight skip predictor. skipsalsa significantly improves ASR performance on multiple low-resource languages yielding up to 20% over a strong baseline.
@inproceedings{mittal-etal-2025-skipsalsa, title = {SKIP-SALSA: Skip Synchronous Fusion of ASR LLM Decoders}, author = {Mittal, Ashish and Prabhu, Darshan and Sarawagi, Sunita and Jyothi, Preethi}, booktitle = {Proceedings of Interspeech}, year = {2025}, }
ACL
LexGen: Domain-aware Multilingual Lexicon Generation

Ayush Maheshwari, Atul Kumar Singh, Karthika NJ, and 3 more authors

In Proceedings of ACL, 2025

Abs Bib PDF Code

Lexicon or dictionary generation across domains has the potential for societal impact, as it can potentially enhance information accessibility for a diverse user base while preserving language identity. Prior work in the field primarily focuses on bilingual lexical induction, which deals with word alignments using mapping or corpora-based approaches. However, these approaches do not cater to domain-specific lexicon generation that consists of domain-specific terminology. This task becomes particularly important in specialized medical, engineering, and other technical domains, owing to the highly infrequent usage of the terms and scarcity of data involving domain-specific terms especially for low/mid-resource languages. In this paper, we propose a new model to generate dictionary words for 6 Indian languages in the multi-domain setting. Our model consists of domain-specific and domain-generic layers that encode information, and these layers are invoked via a learnable routing technique. We also release a new benchmark dataset consisting of >75K translation pairs across 6 Indian languages spanning 8 diverse domains. We conduct both zero-shot and few-shot experiments across multiple domains to show the efficacy of our proposed model in generalizing to unseen domains and unseen languages. Additionally, we also perform a post-hoc human evaluation on unseen languages.
@inproceedings{maheshwari-etal-2025-lexgen, title = {LexGen: Domain-aware Multilingual Lexicon Generation}, author = {Maheshwari, Ayush and Kumar Singh, Atul and NJ, Karthika and Bhatt, Krishnakant and Jyothi, Preethi and Ramakrishnan, Ganesh}, booktitle = {Proceedings of ACL}, year = {2025}, }
ACL
LoFTI: Localization and Factuality Transfer to Indian Locales

Sona Elza Simon, Soumen Kumar Mondal, Abhishek Singhania, and 2 more authors

In Proceedings of ACL Findings, 2025

Abs Bib PDF Code Website

Large language models (LLMs) encode vast amounts of world knowledge acquired via training on large web-scale datasets crawled from the internet. However, the datasets used to train the LLMs typically exhibit a geographical bias towards English-speaking Western countries. This results in LLMs producing biased or hallucinated responses to queries that require answers localized to other geographical regions. In this work, we introduce a new benchmark named LoFTI (Localization and Factuality Transfer to Indian Locales) that can be used to evaluate an LLM’s contextual localization and factual text transfer capabilities. LoFTI consists of factual statements about entities in source and target locations; the source locations are spread across the globe and the target locations are all within India with varying degrees of hyperlocality (country, states, cities). The entities span a wide variety of categories. We use LoFTI to evaluate Mixtral, Llama3.3-70B, GPT-4 and two other Mixtral-based approaches well-suited to the task of localized factual transfer. We demonstrate that LoFTI is a high-quality evaluation benchmark and all the models, including GPT-4, produce skewed results across varying levels of hyperlocality.
@inproceedings{simon-etal-2025-lofti, title = {{LoFTI}: Localization and Factuality Transfer to Indian Locales}, author = {Elza Simon, Sona and Kumar Mondal, Soumen and Singhania, Abhishek and Sen, Sayambhu and Jyothi, Preethi}, booktitle = {Proceedings of ACL Findings}, year = {2025}, }
NAACL (Workshop)

Language-Specific Neurons Do Not Facilitate Cross-Lingual Transfer

Soumen Kumar Mondal, Sayambhu Sen, Abhishek Singhania, and 1 more author

In The Sixth Workshop on Insights from Negative Results in NLP, NAACL, 2025

Abs PDF

Multilingual large language models (LLMs) aim towards robust natural language understanding across diverse languages, yet their performance significantly degrades on low-resource languages. This work explores whether existing techniques to identify language-specific neurons can be leveraged to enhance cross-lingual task performance of low-resource languages. We conduct detailed experiments covering existing language-specific neuron identification techniques (such as LanguageActivation Probability Entropy and activation probability-based thresholding) andneuron-specific LoRA fine-tuning with models like Llama 3.1 and Mistral Nemo. We find that such neuron-specific interventions are insufficient to yield cross-lingual improvements on downstream tasks (XNLI, XQuAD) in low-resource languages. This study highlights the challenges in achieving cross-lingual generalization and provides critical insights for multilingual LLMs.
NAACL
AMPS: ASR with Multimodal Paraphrase Supervision

*Abhishek Gupta, *Amruta Parulekar, Sameep Chattopadhyay, and 1 more author

In Proceedings of NAACL, 2025

Abs Bib PDF Code

Spontaneous or conversational multilingual speech presents many challenges for state-of-the-art automatic speech recognition (ASR) systems. In this work, we present a new technique AMPS, that augments a multilingual multimodal ASR system with paraphrase-based supervision for improved conversational ASR in multiple languages, including Hindi, Marathi, Malayalam, Kannada, and Nyanja. We use paraphrases of the reference transcriptions as additional supervision while training the multimodal ASR model and selectively invoke this paraphrase objective for utterances with poor ASR performance. Using AMPS with a state-of-the-art multimodal model SeamlessM4T, we obtain significant relative reductions in word error rates (WERs) of up to 5%. We present detailed analyses of our system using both objective and human evaluation metrics.
@inproceedings{gupta-etal-2025-amps, title = {{AMPS}: {ASR} with Multimodal Paraphrase Supervision}, author = {Gupta, Abhishek and Parulekar, Amruta and Chattopadhyay, Sameep and Jyothi, Preethi}, booktitle = {Proceedings of NAACL}, year = {2025}, pages = {404--413}, }
COLING

CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving

Bhavani Shankar P S V N, Preethi Jyothi, and Pushpak Bhattacharyya

In Proceedings of COLING, 2025

Abs PDF Code

Code-switching is a widely prevalent linguistic phenomenon in multilingual societies like India. Building speech-to-text models for code-switched speech is challenging due to limited availability of datasets. In this work, we focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text. We present a new end-to-end model architecture CoSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules (that are more widely available for many languages). Speech and ASR text representations are fused using an aligned interleaving scheme and are fed further as input to a pretrained MT module; the whole pipeline is then trained end-to-end for spoken translation using synthetically created ST data. We also release a new evaluation benchmark for code-switched Bengali- English, Hindi-English, Marathi-English and Telugu-English speech to English text. CoSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.

2024

NeurIPS
WikiDO: A New Benchmark Evaluating Cross-Modal Retrieval for Vision-Language Models

*Pavan Kalyan Tankala, *Piyush Pasi, Sahil Dharod, and 4 more authors

In Proceedings of NeurIPS (Datasets and Benchmarks Track), 2024

Abs Bib PDF Website

Cross-modal (image-to-text and text-to-image) retrieval is an established task used in evaluation benchmarks to test the performance of vision-language models (VLMs). Several state-of-the-art VLMs (e.g. CLIP, BLIP-2) have achieved near-perfect performance on widely-used image-text retrieval benchmarks such as MSCOCO-Test-5K and Flickr30K-Test-1K. As a measure of out-of-distribution (OOD) generalization, prior works rely on zero-shot performance evaluated on one dataset (Flickr) using a VLM finetuned on another one (MSCOCO). We argue that such comparisons are insufficient to assess the OOD generalization capability of models due to high visual and linguistic similarity between the evaluation and finetuning datasets. To address this gap, we introduce WikiDO (drawn from Wikipedia Diversity Observatory), a novel cross-modal retrieval benchmark to assess the OOD generalization capabilities of pretrained VLMs. This consists of newly scraped 380K image-text pairs from Wikipedia with domain labels, a carefully curated, human-verified a)in-distribution (ID) test set (3K) and b) OOD test set (3K). The image-text pairs are very diverse in topics and geographical locations. We evaluate different VLMs of varying capacity on the WikiDO benchmark; BLIP-2 achieves zero-shot performance of R@1 ≈66% on the OOD test set, compared to ≈81% on COCO and ≈95% on Flickr. When fine-tuned on WikiDO, the R@1 improvement is at most ≈5% on OOD instances compared to ≈12% on ID instances. We probe the VLMs with varying finetuning objectives and datasets of varying sizes to identify what aids OOD generalization the most. Our results confirm that WikiDO offers a strong cross-modal benchmark for current VLMs in specifically evaluating for OOD generalization. Our benchmark is hosted as a competition at https://kaggle.com/competitions/wikido24 with public access to dataset and code.
@inproceedings{tankala2024wikido, title = {WikiDO: A New Benchmark Evaluating Cross-Modal Retrieval for Vision-Language Models}, author = {Tankala, Pavan Kalyan and Pasi, Piyush and Dharod, Sahil and Motiwala, Azeem and Jyothi, Preethi and Chaudhary, Aditi and Srinivasan, Krishna}, booktitle = {Proceedings of NeurIPS (Datasets and Benchmarks Track)}, pages = {140812--140827}, year = {2024} }
Interspeech
SALSA: Speedy ASR-LLM Synchronous Aggregation

Ashish Mittal, Darshan Prabhu, Sunita Sarawagi, and 1 more author

In Proceedings of Interspeech
This work was nominated for a Best Student Paper Award , 2024

Abs Bib PDF Code

Harnessing pre-trained LLMs to improve ASR systems, particularly for low-resource languages, is now an emerging area of research. Existing methods range from using LLMs for ASR error correction to tightly coupled systems that replace the ASR decoder with the LLM. These approaches either increase decoding time or require expensive training of the cross-attention layers. We propose SALSA, which couples the decoder layers of the ASR to the LLM decoder, while synchronously advancing both decoders. Such coupling is performed with a simple projection of the last decoder state, and is thus significantly more training efficient than earlier approaches. A challenge of our proposed coupling is handling the mismatch between the tokenizers of the LLM and ASR systems. We handle this mismatch using cascading tokenization with respect to the LLM and ASR vocabularies. We evaluate SALSA on 8 low-resource languages in the FLEURS benchmark, yielding substantial WER reductions of up to 38%.
@inproceedings{mittal2024salsa, title = {SALSA: Speedy ASR-LLM Synchronous Aggregation}, author = {Mittal, Ashish and Prabhu, Darshan and Sarawagi, Sunita and Jyothi, Preethi}, booktitle = {Proceedings of Interspeech}, pages = {3485--3489}, year = {2024}, }
Interspeech
Emotion arithmetic: Emotional speech synthesis via weight space interpolation

Pavan Kalyan, Preeti Rao, Preethi Jyothi, and 1 more author

In Proc. Interspeech 2024, 2024

Abs Bib PDF Website

While the idea of task arithmetic has been shown to be useful to steer the behaviour of neural models for NLP and vision tasks, it has not yet been used for speech. Moreover the tasks studied have been restricted to text classification and generation, and image classification. We extend the idea of task vectors to emotional speech synthesis in this work. We build emotion vectors by subtracting the weights of a pre-trained model from the weights of the same model after fine-tuning for a given emotion. These emotion vectors can be modified or combined through arithmetic operations such as negation and addition, with the hope of steering the behaviour of the resulting model accordingly in the generation of emotional speech. We also show that the emotion vector can achieve the desired transfer of emotion to a speaker not seen during training.
@inproceedings{kalyan2024emotion, title = {Emotion arithmetic: Emotional speech synthesis via weight space interpolation}, author = {Kalyan, Pavan and Rao, Preeti and Jyothi, Preethi and Bhattacharyya, Pushpak}, booktitle = {Proc. Interspeech 2024}, pages = {1805--1809}, year = {2024}, }
Interspeech
Multi-Convformer: Extending Conformer with Multiple Convolution Kernels

Darshan Prabhu, Yifan Peng, Preethi Jyothi, and 1 more author

In Proceedings of Interspeech 2024, 2024

Abs Bib PDF Code

Convolutions have become essential in state-of-the-art end-toend Automatic Speech Recognition (ASR) systems due to their efficient modelling of local context. Notably, its use in Conformers has led to superior performance compared to vanilla Transformer-based ASR systems. While components other than the convolution module in the Conformer have been reexamined, altering the convolution module itself has been far less explored. Towards this, we introduce MULTI-CONVFORMER that uses multiple convolution kernels within the convolution module of the Conformer in conjunction with gating. This helps in improved modeling of local dependencies at varying granularities. Our model rivals existing Conformer variants such as CgMLP and E-Branchformer in performance, while being more parameter efficient. We empirically compare our approach with Conformer and its variants across four different datasets and three different modelling paradigms and show up to 8% relative word error rate (WER) improvements.
@inproceedings{prabhu2024multi, title = {{M}ulti-{C}onvformer: Extending Conformer with Multiple Convolution Kernels}, author = {Prabhu, Darshan and Peng, Yifan and Jyothi, Preethi and Watanabe, Shinji}, booktitle = {Proceedings of Interspeech 2024}, pages = {232--236}, year = {2024}, }
Interspeech
Improving Self-supervised Pre-training using Accent-Specific Codebooks

Darshan Prabhu, Abhishek Gupta, Omkar Nitsure, and 2 more authors

In Proc. Interspeech 2024, 2024

Abs Bib PDF

Speech accents present a serious challenge to the performance of state-of-the-art end-to-end Automatic Speech Recognition (ASR) systems. Even with self-supervised learning and pre-training of ASR models, accent invariance is seldom achieved. In this work, we propose an accentaware adaptation technique for self-supervised learning that introduces a trainable set of accent-specific codebooks to the self-supervised architecture. These learnable codebooks enable the model to capture accent specific information during pre-training, that is further refined during ASR finetuning. On the Mozilla Common Voice dataset, our proposed approach outperforms all other accent-adaptation approaches on both seen and unseen English accents, with up to 9% relative reduction in word error rate (WER).
@inproceedings{prabhu2024improving, title = {Improving Self-supervised Pre-training using Accent-Specific Codebooks}, author = {Prabhu, Darshan and Gupta, Abhishek and Nitsure, Omkar and Jyothi, Preethi and Ganapathy, Sriram}, booktitle = {Proc. Interspeech 2024}, pages = {2310--2314}, year = {2024} }
ACL
In-context mixing (ICM): Code-mixed prompts for multilingual LLMs

Bhavani Shankar, Preethi Jyothi, and Pushpak Bhattacharyya

In Proceedings of ACL, 2024

Abs Bib PDF Code

We introduce a simple and effective prompting technique called in-context mixing (ICM) for effective in-context learning (ICL) with multilingual large language models (MLLMs). With ICM, we modify the fewshot examples within ICL prompts to be intra-sententially code-mixed by randomly swapping content words in the target languages with their English translations. We observe that ICM prompts yield superior performance in NLP tasks such as disfluency correction, grammar error correction and text simplification that demand a close correspondence between the input and output sequences. Significant improvements are observed mainly for low-resource languages that are under-represented during the pretraining and finetuning of MLLMs. We present an extensive set of experiments to analyze when ICM is effective and what design choices contribute towards its effectiveness. ICM works consistently and significantly better than other prompting techniques across models of varying capacity such as mT0-XXL, BloomZ and GPT4.
@inproceedings{shankar2024context, title = {In-context mixing ({ICM}): Code-mixed prompts for multilingual {LLM}s}, author = {Shankar, Bhavani and Jyothi, Preethi and Bhattacharyya, Pushpak}, booktitle = {Proceedings of ACL}, pages = {4162--4176}, year = {2024}, }
ACL
Boosting Zero-Shot Crosslingual Performance using LLM-Based Augmentations with Effective Data Selection

*Barah Fazili, *Ashish Agrawal, and Preethi Jyothi

In Proceedings of ACL (Findings), 2024

Abs Bib PDF

Large language models (LLMs) are very proficient text generators. We leverage this capability of LLMs to generate task-specific data via zero-shot prompting and promote crosslingual transfer for low-resource target languages. Given task-specific data in a source language and a teacher model trained on this data, we propose using this teacher to label LLM generations and employ a set of simple data selection strategies that use the teacher’s label probabilities. Our data selection strategies help us identify a representative subset of diverse generations that help boost zero-shot accuracies while being efficient, in comparison to using all the LLM generations (without any subset selection). We also highlight other important design choices that affect cross-lingual performance such as the use of translations of source data and what labels are best to use for the LLM generations. We observe significant performance gains across sentiment analysis and natural language inference tasks (of up to a maximum of 7.13 absolute points and 1.5 absolute points on average) across a number of target languages (Hindi, Marathi, Urdu, Swahili) and domains.
@inproceedings{fazili2024boosting, title = {Boosting Zero-Shot Crosslingual Performance using LLM-Based Augmentations with Effective Data Selection}, author = {Fazili, Barah and Agrawal, Ashish and Jyothi, Preethi}, booktitle = {Proceedings of ACL (Findings)}, pages = {13406--13422}, year = {2024} }
ACL
Part-of-speech Tagging for Extremely Low-resource Indian Languages

Sanjeev Kumar, Preethi Jyothi, and Pushpak Bhattacharyya

In Proceedings of ACL (Findings), 2024

Abs Bib PDF

Modern natural language processing (NLP) systems thrive when given access to large datasets. However, a large fraction of the world’s languages are not privy to such benefits due to sparse documentation and inadequate digital representation. This is especially true for Indian regional languages. As a first step towards expanding the reach of NLP technologies to extremely low-resource Indian languages, we present a new parallel part-of-speech (POS) evaluation dataset for Angika, Magahi, Bhojpuri and Hindi. Angika, Magahi, Bhojpuri, along with the more well-known Hindi, are all languages spoken in the Indian states of Bihar, Jharkhand and West Bengal. Ours is notably the first NLP resource, even for a shallow NLP task like POS-tagging, for Angika. We establish POS-tagging baselines using state-ofthe-art multilingual pretrained language models (PLMs) finetuned on Hindi data, and show zeroshot evaluations on the other three languages. While all four languages use the same Devanagari script, pretrained tokenizers underperform in zero-shot on the three languages. We propose a simple look-back fix to address the tokenization challenge yielding F1-score improvements of up to 8% on Angika, and show how it comes very close to an oracle setting when the underlying Hindi word is known (and can be accurately tokenized).
@inproceedings{kumar2024part, title = {Part-of-speech Tagging for Extremely Low-resource Indian Languages}, author = {Kumar, Sanjeev and Jyothi, Preethi and Bhattacharyya, Pushpak}, booktitle = {Proceedings of ACL (Findings)}, pages = {14422--14431}, year = {2024} }
ACL
DIMSIM: Distilled Multilingual Critics for Indic Text Simplification

Sneha Mondal, Ritika Ritika, Ashish Agrawal, and 2 more authors

In Proceedings of ACL (Findings), 2024

Abs Bib PDF

Self-correction techniques have recently emerged as a promising framework to improve the quality of responses generated by large language models (LLMs). Few-shot prompted LLMs act as critics to produce feedback for an input, which is further fed to a refiner (also an LLM) to produce an output. However, these critique-refine steps require multiple expensive LLM calls. To circumvent this large inference cost, we borrow inspiration from prior work on knowledge distillation and propose the use of critique distillation to train critic models. These are smaller sequence-to-sequence models that are trained on input-critique pairs generated by an LLM. We focus on the problem of text simplification for three Indian languages: Hindi, Bengali and Marathi. This task is a good fit for self-correction style techniques. It also has not been systematically explored for Indian languages before. We train two separate critics that focus on lexical and structure complexity, and show that it is surprisingly more effective than using an LLM directly as a critic in both 0-shot and few-shot settings. We also show the benefits of training multilingual critics, as opposed to monolingual critics. Extensive human evaluations show that on average, raters find 80% of DIMSIM’s output to be simple and easy to read.
@inproceedings{mondal2024dimsim, title = {DIMSIM: Distilled Multilingual Critics for Indic Text Simplification}, author = {Mondal, Sneha and Ritika, Ritika and Agrawal, Ashish and Jyothi, Preethi and Raghuveer, Aravindan}, booktitle = {Proceedings of ACL (Findings)}, pages = {16093--16109}, year = {2024} }
ACL (Workshop)
Parameter-efficient Adaptation of Multilingual Multimodal Models for Low-resource ASR

Abhishek Gupta, Amruta Parulekar, Sameep Chattopadhyay, and 1 more author

In Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL), ACL, 2024

Abs Bib PDF

Automatic speech recognition (ASR) for low-resource languages remains a challenge due to the scarcity of labeled training data. Parameter-efficient fine-tuning and text-only adaptation are two popular methods that have been used to address such low-resource settings. In this work, we investigate how these techniques can be effectively combined using a multilingual multimodal model like SeamlessM4T. Multimodal models are able to leverage unlabeled text via text-only adaptation with further parameter-efficient ASR fine-tuning, thus boosting ASR performance. We also show cross-lingual transfer from a high-resource language, achieving up to a relative 17% WER reduction over baseline in an extremely low-resource setting without any labeled speech.
@inproceedings{gupta-etal-2024-parameter, title = {Parameter-efficient Adaptation of Multilingual Multimodal Models for Low-resource {ASR}}, author = {Gupta, Abhishek and Parulekar, Amruta and Chattopadhyay, Sameep and Jyothi, Preethi}, booktitle = {Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL), ACL}, year = {2024}, pages = {175--185} }
EACL
Translation Errors Significantly Impact Low-Resource Languages in Cross-Lingual Learning

Ashish* Agrawal, Barah* Fazili, and Preethi Jyothi

In Proceedings of EACL, 2024

Abs Bib PDF

Popular benchmarks (e.g., XNLI) used to evaluate cross-lingual language understanding consist of parallel versions of English evaluation sets in multiple target languages created with the help of professional translators. When creating such parallel data, it is critical to ensure high-quality translations for all target languages for an accurate characterization of cross-lingual transfer. In this work, we find that translation inconsistencies do exist and interestingly they disproportionally impact low-resource languages in XNLI. To identify such inconsistencies, we propose measuring the gap in performance between zero-shot evaluations on the human-translated and machine-translated target text across multiple target languages; relatively large gaps are indicative of translation errors. We also corroborate that translation errors exist for two target languages, namely Hindi and Urdu, by doing a manual reannotation of human-translated test instances in these two languages and finding poor agreement with the original English labels these instances were supposed to inherit.
@inproceedings{agrawal2024translation, title = {Translation Errors Significantly Impact Low-Resource Languages in Cross-Lingual Learning}, author = {Agrawal, Ashish and Fazili, Barah and Jyothi, Preethi}, booktitle = {Proceedings of EACL}, pages = {319--329}, year = {2024} }
EACL
STORiCo: Storytelling TTS for Hindi with Character Voice Modulation

Pavan Tankala, Preethi Jyothi, Preeti Rao, and 1 more author

In Proceedings of EACL, 2024

Abs Bib Website

We present a new Hindi text-to-speech (TTS) dataset and demonstrate its utility for the expressive synthesis of children’s audio stories. The dataset comprises narration by a single female speaker who modifies her voice to produce different story characters. Annotation for dialogue identification, character labelling, and character attribution are provided, all of which are expected to facilitate the learning of character voice and speaking styles. Experiments are conducted using different versions of the annotated dataset that enable training a multi-speaker TTS model on the single-speaker data. Subjective tests show that the multi-speaker model improves expressiveness and character voice consistency compared to the baseline single-speaker TTS. With the multi-speaker model, objective evaluations show comparable word error rates, better speaker voice consistency, and higher correlations with ground-truth emotion attributes. We release a new 16.8 hours storytelling speech dataset in Hindi and propose effective solutions for expressive TTS with narrator voice modulation and character voice consistency.
@inproceedings{tankala2024storico, title = {STORiCo: Storytelling TTS for Hindi with Character Voice Modulation}, author = {Tankala, Pavan and Jyothi, Preethi and Rao, Preeti and Bhattacharyya, Pushpak}, booktitle = {Proceedings of EACL}, pages = {426--431}, year = {2024} }

2023

ICLR
In-situ text-only adaptation of speech models with low-overhead speech imputations

Ashish Mittal, Sunita Sarawagi, and Preethi Jyothi

In Proceedings of ICLR, 2023

Abs Bib PDF

Fast and accurate adaptation of automatic speech recognition (ASR) systems using only text data in the target domain is a problem of long-standing practical relevance. Text-only adaptation was easy in traditional cascaded ASR systems with completely decoupled acoustic and language models. Recently, the RNNTransducer (RNN-T) has emerged as a default ASR model because of its high accuracy, low latency, and capability of supporting streaming input. However text-only adaptation of the RNN-T model is significantly more challenging due to its tight integration of acoustic and language models and end-to-end training. Existing recent approaches for text-only adaptation of RNN-Ts, either entail significant modification to the network or introduce high latency during decoding. We propose a new approach (TOLSTOI) that imputes speech representations internal to a baseline RNN-T, starting from text-only inputs, and performs in-situ adaptation that results in higher adaptation accuracy without any runtime overheads during decoding. Our imputation model is a function of the labeled data and trained parameters of the ASR model, and that we show, is more effective in controlling catastrophic forgetting compared to existing methods. We establish the effectiveness of TOLSTOI using three target domains and two ASR models of varying complexity. We yield up to 35% relative reduction in word error rate with text-only adaptation while forgetting the least compared to existing adaptation approaches. Our method is easy to implement and can be harnessed on existing RNN-T models without requiring ASR model training from scratch.
@inproceedings{mittal2023situ, title = {In-situ text-only adaptation of speech models with low-overhead speech imputations}, author = {Mittal, Ashish and Sarawagi, Sunita and Jyothi, Preethi}, booktitle = {Proceedings of ICLR}, year = {2023} }
ICASSP
Towards zero-shot code-switched speech recognition

Brian Yan, Matthew Wiesner, Ondřej Klejch, and 2 more authors

In Proceedings of ICASSP, 2023

Abs Bib PDF

In this work, we seek to build effective code-switched (CS) automatic speech recognition systems (ASR) under the zero-shot set-ting where no transcribed CS speech data is available for training. Previously proposed frameworks which conditionally factorize the bilingual task into its constituent monolingual parts are a promising starting point for leveraging monolingual data efficiently. However, these methods require the monolingual modules to perform language segmentation. That is, each monolingual module has to simultaneously detect CS points and transcribe speech segments of one language while ignoring those of other languages – not a trivial task. We propose to simplify each monolingual module by allowing them to transcribe all speech segments indiscriminately with a monolingual script (i.e. transliteration). This simple modification passes the responsibility of CS point detection to subsequent bilingual modules which determine the final output by considering multiple monolingual transliterations along with external language model information. We apply this transliteration-based approach in an end-to-end differentiable neural network and demonstrate its efficacy for zero-shot CS ASR on Mandarin-English SEAME test sets.
@inproceedings{yan2023towards, title = {Towards zero-shot code-switched speech recognition}, author = {Yan, Brian and Wiesner, Matthew and Klejch, Ond{\v{r}}ej and Jyothi, Preethi and Watanabe, Shinji}, booktitle = {Proceedings of ICASSP}, pages = {1--5}, year = {2023}, }
IJCAI
Temporally aligning long audio interviews with questions: a case study in multimodal data integration

Piyush Singh Pasi, Karthikeya Battepati, Preethi Jyothi, and 3 more authors

In Proceedings of IJCAI, 2023

Abs Bib PDF

The problem of audio-to-text alignment has seen significant amount of research using complete supervision during training. However, this is typically not in the context of long audio recordings wherein the text being queried does not appear verbatim within the audio file. This work is a collaboration with a non-governmental organization called CARE India that collects long audio health surveys from young mothers residing in rural parts of Bihar, India. Given a question drawn from a questionnaire that is used to guide these surveys, we aim to locate where the question is asked within a long audio recording. This is of great value to African and Asian organizations that would otherwise have to painstakingly go through long and noisy audio recordings to locate questions (and answers) of interest. Our proposed framework, INDENT, uses a cross-attention-based model and prior information on the temporal ordering of sentences to learn speech embeddings that capture the semantics of the underlying spoken text. These learnt embeddings are used to retrieve the corresponding audio segment based on text queries at inference time. We empirically demonstrate the significant effectiveness (improvement in R-avg of about 3%) of our model over those obtained using text-based heuristics. We also show how noisy ASR, generated using state-of-the-art ASR models for Indian languages, yields better results when used in place of speech. INDENT, trained only on Hindi data is able to cater to all languages supported by the (semantically) shared text space. We illustrate this empirically on 11 Indic languages.
@inproceedings{pasi2023temporally, title = {Temporally aligning long audio interviews with questions: a case study in multimodal data integration}, author = {Pasi, Piyush Singh and Battepati, Karthikeya and Jyothi, Preethi and Ramakrishnan, Ganesh and Mahapatra, Tanmay and Singh, Manoj}, booktitle = {Proceedings of IJCAI}, pages = {6156--6164}, year = {2023}, }
ACL
Improving pretraining techniques for code-switched NLP

*Richeek Das, *Sahasra Ranjan, Shreya Pathak, and 1 more author

In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
This work received an Outstanding Paper Award , 2023

Abs Bib PDF

Pretrained models are a mainstay in modern NLP applications. Pretraining requires access to large volumes of unlabeled text. While monolingual text is readily available for many of the world’s languages, access to large quantities of code-switched text (i.e., text with tokens of multiple languages interspersed within a sentence) is much more scarce. Given this resource constraint, the question of how pretraining using limited amounts of code-switched text could be altered to improve performance for code-switched NLP becomes important to tackle. In this paper, we explore different masked language modeling (MLM) pretraining techniques for code-switched text that are cognizant of language boundaries prior to masking. The language identity of the tokens can either come from human annotators, trained language classifiers, or simple relative frequencybased estimates. We also present an MLM variant by introducing a residual connection from an earlier layer in the pretrained model that uniformly boosts performance on downstream tasks. Experiments on two downstream tasks, Question Answering (QA) and Sentiment Analysis (SA), involving four code-switched language pairs (Hindi-English, Spanish-English, Tamil-English, Malayalam-English) yield relative improvements of up to 5.8 and 2.7 F1 scores on QA (Hindi-English) and SA (TamilEnglish), respectively, compared to standard pretraining techniques. To understand our task improvements better, we use a series of probes to study what additional information is encoded by our pretraining techniques and also introduce an auxiliary loss function that explicitly models language identification to further aid the residual MLM variants.
@inproceedings{das2023improving, title = {Improving pretraining techniques for code-switched {NLP}}, author = {Das, Richeek and Ranjan, Sahasra and Pathak, Shreya and Jyothi, Preethi}, booktitle = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, pages = {1176--1191}, year = {2023} }
ACL
Zero-shot cross-lingual transfer with learned projections using unlabeled target-language data

Ujan Deb, Ridayesh Parab, and Preethi Jyothi

In Proceedings of ACL, 2023

Abs Bib PDF

Adapters have emerged as a parameter-efficient Transformer-based framework for cross-lingual transfer by inserting lightweight language-specific modules (language adapters) and task-specific modules (task adapters) within pretrained multilingual models. Zero-shot transfer is enabled by pairing the language adapter in the target language with an appropriate task adapter in a source language. If our target languages are known apriori, we explore how zero-shot transfer can be further improved within the adapter framework by utilizing unlabeled text during task-specific finetuning. We construct language-specific subspaces using standard linear algebra constructs and selectively project source-language representations into the target language subspace during task-specific finetuning using two schemes. Our experiments on three cross-lingual tasks, Named Entity Recognition (NER), Question Answering (QA) and Natural Language Inference (NLI) yield consistent benefits compared to adapter baselines over a wide variety of target languages with up to 11% relative improvement in NER, 2% relative improvement in QA and 5% relative improvement in NLI.
@inproceedings{deb2023zero, title = {Zero-shot cross-lingual transfer with learned projections using unlabeled target-language data}, author = {Deb, Ujan and Parab, Ridayesh and Jyothi, Preethi}, booktitle = {Proceedings of ACL}, pages = {449--457}, year = {2023} }
ACL
DITTO: Data-efficient and Fair Targeted Subset Selection for ASR Accent Adaptation

*Suraj Kothawade, *Anmol Mekala, D Chandra Sekhara Hetha Havya, and 4 more authors

In Proceedings of ACL, 2023

Abs Bib PDF

State-of-the-art Automatic Speech Recognition (ASR) systems are known to exhibit disparate performance on varying speech accents. To improve performance on a specific target accent, a commonly adopted solution is to finetune the ASR model using accent-specific labeled speech. However, acquiring large amounts of labeled speech for specific target accents is challenging. Choosing an informative subset of speech samples that are most representative of the target accents becomes important for effective ASR finetuning. To address this problem, we propose DITTO (Data-efficient and Fair Targeted Subset Selection) that uses Submodular Mutual Information (SMI) functions as acquisition functions to find the most informative set of utterances matching a target accent within a fixed budget. An important feature of DITTO is that it supports fair targeting for multiple accents, i.e. it can automatically select representative data points from multiple accents when the ASR model needs to perform well on more than one accent. We show that compared to other speech selection methods, DITTO is 3-5 times as label-efficient for its improvements on the Indic-TTS and L2 datasets.
@inproceedings{kothawade2023ditto, title = {{DITTO}: Data-efficient and Fair Targeted Subset Selection for {ASR} Accent Adaptation}, author = {Kothawade, Suraj and Mekala, Anmol and Havya, D Chandra Sekhara Hetha and Kothyari, Mayank and Iyer, Rishabh and Ramakrishnan, Ganesh and Jyothi, Preethi}, booktitle = {Proceedings of ACL}, pages = {5810--5822}, year = {2023} }
Interspeech
Narrator or Character: Voice Modulation in an Expressive Multi-speaker TTS

Tankala Pavan Kalyan, Preeti Rao, Preethi Jyothi, and 1 more author

In Proceedings of Interspeech, 2023

Abs Bib PDF Website

Current Text-to-Speech (TTS) systems are trained on audiobook data and perform well in synthesizing read-style speech. In this work, we are interested in synthesizing audio stories as narrated to children. The storytelling style is more expressive and requires perceptible changes of voice across the narrator and story characters. To address these challenges, we present a new TTS corpus of English audio stories for children with 32.7 hours of speech by a single female speaker with a UK accent. We provide evidence of the salient differences in the suprasegmentals of the narrator and character utterances in the dataset, motivating the use of a multi-speaker TTS for our application. We use a fine-tuned BERT model to label each sentence as being spoken by a narrator or character that is subsequently used to condition the TTS output. Experiments show our new TTS system is superior in expressiveness in both A-B preference and MOS testing compared to reading-style TTS and single-speaker TTS.
@inproceedings{pavan2023narrator, title = {Narrator or Character: Voice Modulation in an Expressive Multi-speaker TTS}, author = {Pavan Kalyan, Tankala and Rao, Preeti and Jyothi, Preethi and Bhattacharyya, Pushpak}, booktitle = {Proceedings of Interspeech}, pages = {4808--4812}, year = {2023}, }
Interspeech
Improving RNN-Transducers with Acoustic LookAhead

Vinit S Unni, Ashish Mittal, Preethi Jyothi, and 1 more author

In Proceedings of Interspeech, 2023

Abs Bib PDF

RNN-Transducers (RNN-Ts) have gained widespread acceptance as an end-to-end model for speech to text conversion because of their high accuracy and streaming capabilities. A typical RNN-T independently encodes the input audio and the text context, and combines the two encodings by a thin joint network. While this architecture provides SOTA streaming accuracy, it also makes the model vulnerable to strong LM biasing which manifests as multi-step hallucination of text without acoustic evidence. In this paper we propose LOOKAHEAD that makes text representations more acoustically grounded by looking ahead into the future within the audio input. This technique yields a significant 5%-20% relative reduction in word error rate on both in-domain and out-of-domain evaluation sets.
@inproceedings{unni2023improving, title = {Improving RNN-Transducers with Acoustic LookAhead}, author = {Unni, Vinit S and Mittal, Ashish and Jyothi, Preethi and Sarawagi, Sunita}, booktitle = {Proceedings of Interspeech}, pages = {4419--4423}, year = {2023}, }
Interspeech
Unsupervised Code-switched Text Generation from Parallel Text

Jie Chi, Brian Lu, Jason Eisner, and 3 more authors

In Proc. Interspeech 2023, 2023

Abs Bib PDF

There has been great interest in developing automatic speech recognition (ASR) systems that can handle code-switched (CS) speech to meet the needs of a growing bilingual population. However, existing datasets are limited in size. It is expensive and difficult to collect real transcribed spoken CS data due to the challenges of finding and identifying CS data in the wild. As a result, many attempts have been made to generate synthetic CS data. Existing methods either require the existence of CS data during training, or are driven by linguistic knowledge. We introduce a novel approach of forcing a multilingual MT system that was trained on non-CS data to generate CS translations. Comparing against two prior methods, we show that simply leveraging the shared representations of two languages (Mandarin and English) yields better CS text generation and, ultimately, better CS ASR.
@inproceedings{chi2023unsupervised, title = {Unsupervised Code-switched Text Generation from Parallel Text}, author = {Chi, Jie and Lu, Brian and Eisner, Jason and Bell, Peter and Jyothi, Preethi and Ali, Ahmed M}, booktitle = {Proc. Interspeech 2023}, pages = {1419--1423}, year = {2023}, }
EMNLP
Accented Speech Recognition With Accent-specific Codebooks

Darshan Prabhu, Preethi Jyothi, Sriram Ganapathy, and 1 more author

In Proceedings of EMNLP, 2023

Abs Bib PDF Code

Speech accents pose a significant challenge to state-of-the-art automatic speech recognition (ASR) systems. Degradation in performance across underrepresented accents is a severe deterrent to the inclusive adoption of ASR. In this work, we propose a novel accent adaptation approach for end-to-end ASR systems using cross-attention with a trainable set of codebooks. These learnable codebooks capture accent-specific information and are integrated within the ASR encoder layers. The model is trained on accented English speech, while the test data also contained accents which were not seen during training. On the Mozilla Common Voice multi-accented dataset, we show that our proposed approach yields significant performance gains not only on the seen English accents (up to 37% relative improvement in word error rate) but also on the unseen accents (up to 5% relative improvement in WER). Further, we illustrate benefits for a zero-shot transfer setup on the L2Artic dataset. We also compare the performance with other approaches based on accent adversarial training.
@inproceedings{prabhu2023accented, title = {Accented Speech Recognition With Accent-specific Codebooks}, author = {Prabhu, Darshan and Jyothi, Preethi and Ganapathy, Sriram and Unni, Vinit}, booktitle = {Proceedings of EMNLP}, pages = {7175--7188}, year = {2023} }
EMNLP
Speech-enriched memory for inference-time adaptation of asr models to word dictionaries

Ashish Mittal, Sunita Sarawagi, Preethi Jyothi, and 2 more authors

In Proceedings of EMNLP, 2023

Abs Bib PDF

Despite the impressive performance of ASR models on mainstream benchmarks, their performance on rare words is unsatisfactory. In enterprise settings, often a focused list of entities (such as locations, names, etc) are available which can be used to adapt the model to the terminology of specific domains. In this paper, we present a novel inference algorithm that improves the prediction of state-of-the-art ASR models using nearest-neighbor-based matching on an inference-time word list. We consider both the Transducer architecture that is useful in the streaming setting, and state-of-the-art encoder-decoder models such as Whisper. In our approach, a list of rare entities is indexed in a memory by synthesizing speech for each entry, and then storing the internal acoustic and language model states obtained from the best possible alignment on the ASR model. The memory is organized as a trie which we harness to perform a stateful lookup during inference. A key property of our extension is that we prevent spurious matches by restricting to only word-level matches. In our experiments on publicly available datasets and private benchmarks, we show that our method is effective in significantly improving rare word recognition.
@inproceedings{mittal2023speech, title = {Speech-enriched memory for inference-time adaptation of asr models to word dictionaries}, author = {Mittal, Ashish and Sarawagi, Sunita and Jyothi, Preethi and Saon, George and Kurata, Gakuto}, booktitle = {Proceedings of EMNLP}, pages = {14820--14835}, year = {2023} }
EMNLP
DISCO: A Large Scale Human Annotated Corpus for Disfluency Correction in Indo-European Languages

Vineet Bhat, Preethi Jyothi, and Pushpak Bhattacharyya

In Proceedings of EMNLP (Findings), 2023

Abs Bib PDF Website

Disfluency correction (DC) is the process of removing disfluent elements like fillers, repetitions and corrections from spoken utterances to create readable and interpretable text. DC is a vital post-processing step applied to Automatic Speech Recognition (ASR) outputs, before subsequent processing by downstream language understanding tasks. Existing DC research has primarily focused on English due to the unavailability of large-scale open-source datasets. Towards the goal of multilingual disfluency correction, we present a high-quality human-annotated DC corpus covering four important Indo-European languages: English, Hindi, German and French. We provide extensive analysis of results of state-of-the-art DC models across all four languages obtaining F1 scores of 97.55 (English), 94.29 (Hindi), 95.89 (German) and 92.97 (French). To demonstrate the benefits of DC on downstream tasks, we show that DC leads to 5.65 points increase in BLEU scores on average when used in conjunction with a state-of-the-art Machine Translation (MT) system.
@inproceedings{bhat2023disco, title = {{DISCO}: A Large Scale Human Annotated Corpus for Disfluency Correction in Indo-European Languages}, author = {Bhat, Vineet and Jyothi, Preethi and Bhattacharyya, Pushpak}, booktitle = {Proceedings of EMNLP (Findings)}, pages = {12833--12857}, year = {2023} }
ICLR (Workshop)
Surprisingly Simple Adapter Ensembling for Zero-shot Cross-lingual Sequence Tagging

Rohan Shah and Preethi Jyothi

2023

Abs Bib PDF

Adapters are parameter-efficient modules added to pretrained Transformer models that facilitate cross-lingual transfer. Language adapters and task adapters can be separately trained and zero-shot transfer is enabled by pairing the language adapter in the target language with a task adapter trained on a high-resource language. However, there are many languages and dialects for which training language adapters would be difficult. In this work, we present a simple and efficient ensembling technique to transfer task knowledge to unseen target languages for which no language adapters exist. We compute a uniformly-weighted ensemble model over the top language adapters based on how well they perform on the test set of a high-resource language. We outperform the state-of-the-art model for this specific setting on named entity recognition (NER) and part-of-speech tagging (POS), across nine typologically diverse languages with relative performance improvements of up to 29% and 9% on NER and POS, respectively, on select target languages.
@article{shahsurprisingly, title = {Surprisingly Simple Adapter Ensembling for Zero-shot Cross-lingual Sequence Tagging}, author = {Shah, Rohan and Jyothi, Preethi}, booktitle = {Practical ML for Developing Countries Workshop, ICLR 2023}, year = {2023}, }

2022

ACL
Accurate Online Posterior Alignments for Principled Lexically-Constrained Decoding

Soumya Chatterjee, Sunita Sarawagi, and Preethi Jyothi

In Proceedings of ACL, 2022

Abs Bib PDF

Online alignment in machine translation refers to the task of aligning a target word to a source word when the target sequence has only been partially decoded. Good online alignments facilitate important applications such as lexically constrained translation where user-defined dictionaries are used to inject lexical constraints into the translation model. We propose a novel posterior alignment technique that is truly online in its execution and superior in terms of alignment error rates compared to existing methods. Our proposed inference technique jointly considers alignment and token probabilities in a principled manner and can be seamlessly integrated within existing constrained beam-search decoding algorithms. On five language pairs, including two distant language pairs, we achieve consistent drop in alignment error rates. When deployed on seven lexically constrained translation tasks, we achieve significant improvements in BLEU specifically around the constrained positions.
@inproceedings{chatterjee2022accurate, title = {Accurate Online Posterior Alignments for Principled Lexically-Constrained Decoding}, author = {Chatterjee, Soumya and Sarawagi, Sunita and Jyothi, Preethi}, booktitle = {Proceedings of ACL}, pages = {6675--6689}, year = {2022} }
COLING
Aligning multilingual embeddings for improved code-switched natural language understanding

Barah Fazili and Preethi Jyothi

In Proceedings of COLING, 2022

Abs Bib PDF

Multilingual pretrained models, while effective on monolingual data, need additional training to work well with code-switched text. In this work, we present a novel idea of training multilingual models with alignment objectives using parallel text so as to explicitly align word representations with the same underlying semantics across languages. Such an explicit alignment step has a positive downstream effect and improves performance on multiple code-switched NLP tasks. We explore two alignment strategies and report improvements of up to 7.32%, 0.76% and 1.9% on Hindi-English Sentiment Analysis, Named Entity Recognition and Question Answering tasks compared to a competitive baseline model.
@inproceedings{fazili2022aligning, title = {Aligning multilingual embeddings for improved code-switched natural language understanding}, author = {Fazili, Barah and Jyothi, Preethi}, booktitle = {Proceedings of COLING}, pages = {4268--4273}, year = {2022}, }
COLING
Zero-shot disfluency detection for Indian languages

Rohit Kundu, Preethi Jyothi, and Pushpak Bhattacharyya

In Proceedings of COLING, 2022

Abs Bib PDF Website

Disfluencies that appear in the transcriptions from automatic speech recognition systems tend to impair the performance of downstream NLP tasks. Disfluency correction models can help alleviate this problem. However, the unavailability of labeled data in low-resource languages impairs progress. We propose using a pretrained multilingual model, finetuned only on English disfluencies, for zero-shot disfluency detection in Indian languages. We present a detailed pipeline to synthetically generate disfluent text and create evaluation datasets for four Indian languages: Bengali, Hindi, Malayalam, and Marathi. Even in the zero-shot setting, we obtain F1 scores of 75 and higher on five disfluency types across all four languages. We also show the utility of synthetically generated disfluencies by evaluating on real disfluent text in Bengali, Hindi, and Marathi. Finetuning the multilingual model on additional synthetic Hindi disfluent text nearly doubles the number of exact matches and yields a 20-point boost in F1 scores when evaluated on real Hindi disfluent text, compared to training with only English disfluent text.
@inproceedings{kundu2022zero, title = {Zero-shot disfluency detection for Indian languages}, author = {Kundu, Rohit and Jyothi, Preethi and Bhattacharyya, Pushpak}, booktitle = {Proceedings of COLING}, pages = {4442--4454}, year = {2022}, }
EMNLP
CoCoa: An Encoder-Decoder Model for Controllable Code-switched Generation

Sneha Mondal, Shreya Pathak, Preethi Jyothi, and 2 more authors

In Proceedings of EMNLP, 2022

Abs Bib PDF Website

Code-switching has seen growing interest in recent years as an important multilingual NLP phenomenon. Generating code-switched text for data augmentation has been sufficiently well-explored. However, there is no prior work on generating code-switched text with fine-grained control on the degree of code-switching and the lexical choices used to convey formality. We present CoCoa, an encoder-decoder translation model that converts monolingual Hindi text to Hindi-English code-switched text with both encoder-side and decoder-side interventions to achieve fine-grained controllable generation. CoCoa can be invoked at test-time to synthesize code-switched text that is simultaneously faithful to syntactic and lexical attributes relevant to code-switching. CoCoa outputs were subjected to rigorous subjective and objective evaluations. Human evaluations establish that our outputs are of superior quality while being faithful to desired attributes. We show significantly improved BLEU scores when compared with human-generated code-switched references. Compared to competitive baselines, we show 10% reduction in perplexity on a language modeling task and also demonstrate clear improvements on a downstream code-switched sentiment analysis task.
@inproceedings{mondal2022cocoa, title = {CoCoa: An Encoder-Decoder Model for Controllable Code-switched Generation}, author = {Mondal, Sneha and Pathak, Shreya and Jyothi, Preethi and Raghuveer, Aravindan and others}, booktitle = {Proceedings of EMNLP}, pages = {2466--2479}, year = {2022}, }
EMNLP
Partitioned Gradient Matching-based Data Subset Selection for Compute-Efficient Robust ASR Training

Ashish Mittal, Durga Sivasubramanian, Rishabh Iyer, and 2 more authors

In Proceedings of EMNLP (Findings), 2022

Abs Bib PDF

Training state-of-the-art ASR systems such as RNN-T often has a high associated financial and environmental cost. Training with a subset of training data could mitigate this problem if the subset selected could achieve on-par performance with training with the entire dataset. Although there are many data subset selection(DSS) algorithms, direct application to the RNN-T is difficult, especially the DSS algorithms that are adaptive and use learning dynamics such as gradients, as RNN-T tend to have gradients with a significantly larger memory footprint. In this paper, we propose Partitioned Gradient Matching (PGM) a novel distributable DSS algorithm, suitable for massive datasets like those used to train RNN-T. Through extensive experiments on Librispeech 100H and Librispeech 960H, we show that PGM achieves between 3x to 6x speedup with only a very small accuracy degradation (under 1% absolute WER difference). In addition, we demonstrate similar results for PGM even in settings where the training data is corrupted with noise.
@inproceedings{mittal2022partitioned, title = {Partitioned Gradient Matching-based Data Subset Selection for Compute-Efficient Robust ASR Training}, author = {Mittal, Ashish and Sivasubramanian, Durga and Iyer, Rishabh and Jyothi, Preethi and Ramakrishnan, Ganesh}, booktitle = {Proceedings of EMNLP (Findings)}, pages = {5999--6010}, year = {2022} }
ICASSP
Adaptive discounting of implicit language models in rnn-transducers

Vinit Unni, Shreya Khare, Ashish Mittal, and 3 more authors

In Proceedings of ICASSP, 2022

Abs Bib PDF

RNN-Transducer (RNN-T) models have become synonymous with streaming end-to-end ASR systems. While they perform competitively on a number of evaluation categories, rare words pose a serious challenge to RNN-T models. One main reason for the degradation in performance on rare words is that the language model (LM) internal to RNN-Ts can be-come overconfident and lead to hallucinated predictions that are acoustically inconsistent with the underlying speech. To address this issue, we propose a lightweight adaptive LM dis-counting technique ADAPTLMD, that can be used with any RNN-T architecture without requiring any external resources or additional parameters. ADAPTLMD uses a two-pronged approach: 1. Randomly mask the prediction network output to encourage the RNN-T to not be overly reliant on it’s outputs. 2. Dynamically choose when to discount the implicit LM (ILM) based on rarity of recently predicted tokens and divergence between ILM and implicit acoustic model (IAM) scores. Comparing ADAPTLMD to a competitive RNN-T baseline, we obtain up to 4% and 14% relative reductions in overall WER and rare word PER, respectively, on a conversational, code-mixed Hindi-English ASR task.
@inproceedings{unni2022adaptive, title = {Adaptive discounting of implicit language models in rnn-transducers}, author = {Unni, Vinit and Khare, Shreya and Mittal, Ashish and Jyothi, Preethi and Sarawagi, Sunita and Bharadwaj, Samarth}, booktitle = {Proceedings of ICASSP}, pages = {8122--8126}, year = {2022} }
Interspeech
SPLICEOUT: A Simple and Efficient Audio Augmentation Method

Arjit Jain, Pranay Reddy Samala, Deepak Mittal, and 2 more authors

In Proceedings of Interspeech
Pseudocode in the arxiv version , 2022

Abs Bib PDF

Time masking has become a de facto augmentation technique for speech and audio tasks, including automatic speech recognition (ASR) and audio classification, most notably as a part of SpecAugment. In this work, we propose SpliceOut, a simple modification to time masking which makes it computationally more efficient. SpliceOut performs comparably to (and sometimes outperforms) SpecAugment on a wide variety of speech and audio tasks, including ASR for seven different languages using varying amounts of training data, as well as on speech translation, sound and music classification, thus establishing itself as a broadly applicable audio augmentation method. SpliceOut also provides additional gains when used in conjunction with other augmentation techniques. Apart from the fully-supervised setting, we also demonstrate that SpliceOut can complement unsupervised representation learning with performance gains in the semi-supervised and self-supervised settings.
@inproceedings{jain2022spliceout, title = {{SPLICEOUT}: A Simple and Efficient Audio Augmentation Method}, author = {Jain, Arjit and Samala, Pranay Reddy and Mittal, Deepak and Jyothi, Preethi and Singh, Maneesh}, booktitle = {Proceedings of Interspeech}, pages = {2678--2682}, year = {2022} }
Interspeech
Linguistically Informed Post-processing for ASR Error correction in Sanskrit.

Rishabh Kumar, Devaraja Adiga, Rishav Ranjan, and 4 more authors

In Proceedings of Interspeech, 2022

Abs Bib PDF

We propose an ASR system for Sanskrit, a lowresource language, that effectively combines subword tokenisation strategies and search space enrichment with linguistic information. More specifically, to address the challenges due to the high degree of out-of-vocabulary entries present in the language, we first use a subword-based language model and acoustic model to generate a search space. The search space, so obtained, is converted into a word-based search space and is further enriched with morphological and lexical information based on a shallow parser. Finally, the transitions in the search space are rescored using a supervised morphological parser proposed for Sanskrit. Our proposed approach currently reports the state-of-the-art results in Sanskrit ASR, with a 7.18 absolute point reduction in WER than the previous state-of-the-art.
@inproceedings{kumar2022linguistically, title = {Linguistically Informed Post-processing for ASR Error correction in Sanskrit.}, author = {Kumar, Rishabh and Adiga, Devaraja and Ranjan, Rishav and Krishna, Amrith and Ramakrishnan, Ganesh and Goyal, Pawan and Jyothi, Preethi}, booktitle = {Proceedings of Interspeech}, pages = {2293--2297}, year = {2022} }

2021

EMNLP (Workshop)
The Effectiveness of Intermediate-Task Training for Code-Switched Natural Language Understanding

Archiki Prasad, Mohammad Ali Rehan, Shreya Pathak, and 1 more author

In Proceedings of the 1st Workshop on Multilingual Representation Learning (MRL)
This work received an Honorable Mention Award , 2021

Abs Bib PDF

While recent benchmarks have spurred a lot of new work on improving the generalization of pretrained multilingual language models on multilingual tasks, techniques to improve code-switched natural language understanding tasks have been far less explored. In this work, we propose the use of bilingual intermediate pretraining as a reliable technique to derive large and consistent performance gains on three different NLP tasks using code-switched text. We achieve substantial absolute improvements of 7.87%, 20.15%, and 10.99%, on the mean accuracies and F1 scores over previous state-of-the-art systems for Hindi-English Natural Language Inference (NLI), Question Answering (QA) tasks, and Spanish-English Sentiment Analysis (SA) respectively. We show consistent performance gains on four different code-switched language-pairs (Hindi-English, Spanish-English, Tamil-English and Malayalam-English) for SA. We also present a code-switched masked language modelling (MLM) pretraining technique that consistently benefits SA compared to standard MLM pretraining using real code-switched text.
@inproceedings{prasad2021effectiveness, title = {The Effectiveness of Intermediate-Task Training for Code-Switched Natural Language Understanding}, author = {Prasad, Archiki and Rehan, Mohammad Ali and Pathak, Shreya and Jyothi, Preethi}, booktitle = {Proceedings of the 1st Workshop on Multilingual Representation Learning ({MRL})}, pages = {176--190}, year = {2021} }
Interspeech
Reduce and Reconstruct: ASR for Low-Resource Phonetic Languages

Anuj Diwan and Preethi Jyothi

In Proceedings of Interspeech
This work was nominated for a Best Student Paper Award , 2021

Abs Bib PDF

This work presents a seemingly simple but effective technique to improve low-resource ASR systems for phonetic languages. By identifying sets of acoustically similar graphemes in these languages, we first reduce the output alphabet of the ASR system using linguistically meaningful reductions and then reconstruct the original alphabet using a standalone module. We demonstrate that this lessens the burden and improves the performance of low-resource end-to-end ASR systems (because only reduced-alphabet predictions are needed) and that it is possible to design a very simple but effective reconstruction module that recovers sequences in the original alphabet from sequences in the reduced alphabet. We present a finite state transducer-based reconstruction module that operates on the 1-best ASR hypothesis in the reduced alphabet. We demonstrate the efficacy of our proposed technique using ASR systems for two Indian languages, Gujarati and Telugu. With access to only 10 hrs of speech data, we obtain relative WER reductions of up to 7% compared to systems that do not use any reduction.
@inproceedings{diwan2021reduce, title = {Reduce and Reconstruct: ASR for Low-Resource Phonetic Languages}, author = {Diwan, Anuj and Jyothi, Preethi}, booktitle = {Proceedings of Interspeech}, pages = {3445--3449}, year = {2021} }
Interspeech
Low Resource ASR: The Surprising Effectiveness of High Resource Transliteration.

Shreya Khare, Ashish R Mittal, Anuj Diwan, and 3 more authors

In Proceedings of Interspeech, 2021

Abs Bib PDF

Cross-lingual transfer of knowledge from high-resource languages to low-resource languages is an important research problem in automatic speech recognition (ASR). We propose a new strategy of transfer learning by pretraining using large amounts of speech in the high-resource language but with its text transliterated to the target low-resource language. This simple mapping of scripts explicitly encourages increased sharing between the output spaces of both languages and is surprisingly effective even when the high-resource and low-resource languages are from unrelated language families. The utility of our proposed technique is more evident in very low-resource scenarios, where better initializations are more beneficial. We evaluate our technique on a transformer ASR architecture and the state-ofthe-art wav2vec2. 0 ASR architecture, with English as the highresource language and six languages as low-resource targets. With access to 1 hour of target speech, we obtain relative WER reductions of up to 8.2% compared to existing transfer-learning approaches.
@inproceedings{khare2021low, title = {Low Resource ASR: The Surprising Effectiveness of High Resource Transliteration.}, author = {Khare, Shreya and Mittal, Ashish R and Diwan, Anuj and Sarawagi, Sunita and Jyothi, Preethi and Bharadwaj, Samarth}, booktitle = {Proceedings of Interspeech}, pages = {1529--1533}, year = {2021} }
Interspeech
Cross-Modal Learning for Audio-Visual Video Parsing

Jatin Lamba, Jayaprakash Akula, Rishabh Dabral, and 3 more authors

In Proceedings of Interspeech, 2021

Abs Bib PDF

In this paper, we present a novel approach to the audio-visual video parsing (AVVP) task that demarcates events from a video separately for audio and visual modalities. The proposed parsing approach simultaneously detects the temporal boundaries in terms of start and end times of such events. We show how AVVP can benefit from the following techniques geared towards effective cross-modal learning: (i) adversarial training and skip connections (ii) global context aware attention and, (iii) self-supervised pretraining using an audio-video grounding objective to obtain cross-modal audio-video representations. We present extensive experimental evaluations on the Look, Listen, and Parse (LLP) dataset and show that we outperform the state-of-the-art Hybrid Attention Network (HAN) on all five metrics proposed for AVVP. We also present several ablations to validate the effect of pretraining, global attention and adversarial training.
@inproceedings{lamba2021cross, title = {Cross-Modal Learning for Audio-Visual Video Parsing}, author = {Lamba, Jatin and Akula, Jayaprakash and Dabral, Rishabh and Jyothi, Preethi and Ramakrishnan, Ganesh and others}, booktitle = {Proceedings of Interspeech}, pages = {1937--1941}, year = {2021} }
Interspeech
MUCS 2021: Multilingual and code-switching ASR challenges for low resource Indian languages

Anuj Diwan, Rakesh Vaideeswaran, Sanket Shah, and 5 more authors

In
Datasets are at link1 and link2 , 2021

Abs Bib PDF

Recently, there is increasing interest in multilingual automatic speech recognition (ASR) where a speech recognition system caters to multiple low resource languages by taking advantage of low amounts of labeled corpora in multiple languages. With multilingualism becoming common in today’s world, there has been increasing interest in code-switching ASR as well. In code-switching, multiple languages are freely interchanged within a single sentence or between sentences. The success of low-resource multilingual and code-switching ASR often depends on the variety of languages in terms of their acoustics, linguistic characteristics as well as the amount of data available and how these are carefully considered in building the ASR system. In this challenge, we would like to focus on building multilingual and code-switching ASR systems through two different subtasks related to a total of seven Indian languages, namely Hindi, Marathi, Odia, Tamil, Telugu, Gujarati and Bengali. For this purpose, we provide a total of 600 hours of transcribed speech data, comprising train and test sets, in these languages including two code-switched language pairs, Hindi-English and Bengali-English. We also provide a baseline recipe for both the tasks with a WER of 30.73% and 32.45% on the test sets of multilingual and code-switching subtasks, respectively.
@inproceedings{diwan2021mucs, title = {{MUCS} 2021: Multilingual and code-switching {ASR} challenges for low resource Indian languages}, author = {Diwan, Anuj and Vaideeswaran, Rakesh and Shah, Sanket and Singh, Ankita and Raghavan, Srinivasa and Khare, Shreya and Unni, Vinit and others}, pages = {2446--2450}, year = {2021} }
ACL
From Machine Translation to Code-Switching: Generating High-Quality Code-Switched Text

Ishan Tarunesh, Syamantak Kumar, and Preethi Jyothi

In Proceedings of ACL, 2021

Abs Bib PDF Code

Generating code-switched text is a problem of growing interest, especially given the scarcity of corpora containing large volumes of real code-switched text. In this work, we adapt a state-of-the-art neural machine translation model to generate Hindi-English code-switched sentences starting from monolingual Hindi sentences. We outline a carefully designed curriculum of pretraining steps, including the use of synthetic code-switched text, that enable the model to generate high-quality code-switched text. Using text generated from our model as data augmentation, we show significant reductions in perplexity on a language modeling task, compared to using text from other generative models of CS text. We also show improvements using our text for a downstream code-switched natural language inference task. Our generated text is further subjected to a rigorous evaluation using a human evaluation study and a range of objective metrics, where we show performance comparable (and sometimes even superior) to code-switched text obtained via crowd workers who are native Hindi speakers.
@inproceedings{tarunesh2021machine, title = {From Machine Translation to Code-Switching: Generating High-Quality Code-Switched Text}, author = {Tarunesh, Ishan and Kumar, Syamantak and Jyothi, Preethi}, booktitle = {Proceedings of ACL}, pages = {3154--3169}, year = {2021} }
ACL
Automatic Speech Recognition in Sanskrit: A New Speech Corpus and Modelling Insights

Devaraja Adiga, Rishabh Kumar, Amrith Krishna, and 3 more authors

In Proceedings of ACL (Findings), 2021

Abs Bib PDF

Automatic speech recognition (ASR) in Sanskrit is interesting, owing to the various linguistic peculiarities present in the language. The Sanskrit language is lexically productive, undergoes euphonic assimilation of phones at the word boundaries and exhibits variations in spelling conventions and in pronunciations. In this work, we propose the first large scale study of automatic speech recognition (ASR) in Sanskrit, with an emphasis on the impact of unit selection in Sanskrit ASR. In this work, we release a 78 hour ASR dataset for Sanskrit, which faithfully captures several of the linguistic characteristics expressed by the language. We investigate the role of different acoustic model and language model units in ASR systems for Sanskrit. We also propose a new modelling unit, inspired by the syllable level unit selection, that captures character sequences from one vowel in the word to the next vowel. We also highlight the importance of choosing graphemic representations for Sanskrit and show the impact of this choice on word error rates (WER). Finally, we extend these insights from Sanskrit ASR for building ASR systems in two other Indic languages, Gujarati and Telugu. For both these languages, our experimental results show that the use of phonetic based graphemic representations in ASR results in performance improvements as compared to ASR systems that use native scripts.
@inproceedings{adiga2021automatic, title = {Automatic Speech Recognition in Sanskrit: A New Speech Corpus and Modelling Insights}, author = {Adiga, Devaraja and Kumar, Rishabh and Krishna, Amrith and Jyothi, Preethi and Ramakrishnan, Ganesh and Goyal, Pawan}, booktitle = {Proceedings of ACL (Findings)}, pages = {5039--5050}, year = {2021} }
IJCAI
Perturb, Predict & Paraphrase: Semi-Supervised Learning using Noisy Student for Image Captioning.

Arjit Jain, Pranay Reddy Samala, Preethi Jyothi, and 1 more author

In Proceedings of IJCAI, 2021

Abs Bib PDF

Recent semi-supervised learning (SSL) methods are predominantly focused on multi-class classification tasks. Classification tasks allow for easy mixing of class labels during augmentation which does not trivially extend to structured outputs such as word sequences that appear in tasks like image captioning. Noisy Student Training is a recent SSL paradigm proposed for image classification that is an extension of self-training and teacher-student learning. In this work, we provide an in-depth analysis of the noisy student SSL framework for the task of image captioning and derive state-of-the-art results. The original algorithm relies on computationally expensive data augmentation steps that involve perturbing the raw images and computing features for each perturbed image. We show that, even in the absence of raw image augmentation, the use of simple model and feature perturbations to the input images for the student model are beneficial to SSL training. We also show how a paraphrase generator could be effectively used for label augmentation to improve the quality of pseudo labels and significantly improve performance. Our final results in the limited labeled data setting (1% of the MS-COCO labeled data) outperform previous state-of-the-art approaches by 2.5 on BLEU4 and 11.5 on CIDEr scores.
@inproceedings{jain2021perturb, title = {Perturb, Predict \& Paraphrase: Semi-Supervised Learning using Noisy Student for Image Captioning.}, author = {Jain, Arjit and Samala, Pranay Reddy and Jyothi, Preethi and Mittal, Deepak}, booktitle = {Proceedings of IJCAI}, pages = {758--764}, year = {2021} }
NAACL Workskop
The effect of pretraining on extractive summarization for scientific documents

Yash Gupta, Pawan Sasanka Ammanamanchi, Shikha Bordia, and 7 more authors

In Proceedings of the Second Workshop on Scholarly Document Processing, 2021

Abs Bib PDF

Large pretrained models have seen enormous success in extractive summarization tasks. In this work, we investigate the influence of pretraining on a BERT-based extractive summarization system for scientific documents. We derive significant performance improvements using an intermediate pretraining step that leverages existing summarization datasets and report state-of-the-art results on a recently released scientific summarization dataset, SciTLDR. We systematically analyze the intermediate pretraining step by varying the size and domain of the pretraining corpus, changing the length of the input sequence in the target task and varying target tasks. We also investigate how intermediate pretraining interacts with contextualized word embeddings trained on different domains.
@inproceedings{gupta2021effect, title = {The effect of pretraining on extractive summarization for scientific documents}, author = {Gupta, Yash and Ammanamanchi, Pawan Sasanka and Bordia, Shikha and Manoharan, Arjun and Mittal, Deepak and Pasunuru, Ramakanth and Shrivastava, Manish and Singh, Maneesh and Bansal, Mohit and Jyothi, Preethi}, booktitle = {Proceedings of the Second Workshop on Scholarly Document Processing}, pages = {73--82}, year = {2021} }
SIGIR
Select, substitute, search: A new benchmark for knowledge-augmented visual question answering

Aman Jain, Mayank Kothyari, Vishwajeet Kumar, and 3 more authors

In Proceedings of SIGIR, 2021

Abs Bib PDF Website

Multimodal IR, spanning text corpus, knowledge graph and images, called outside knowledge visual question answering (OKVQA), is of much recent interest. However, the popular data set has serious limitations. A surprisingly large fraction of queries do not assess the ability to integrate cross-modal information. Instead, some are independent of the image, some depend on speculation, some require OCR or are otherwise answerable from the image alone. To add to the above limitations, frequency-based guessing is very effective because of (unintended) widespread answer overlaps between the train and test folds. Overall, it is hard to determine when state-of-the-art systems exploit these weaknesses rather than really infer the answers, because they are opaque and their ’reasoning’ process is uninterpretable. An equally important limitation is that the dataset is designed for the quantitative assessment only of the end-to-end answer retrieval task, with no provision for assessing the correct(semantic) interpretation of the input query. In response, we identify a key structural idiom in OKVQA ,viz., S3 (select, substitute and search), and build a new data set and challenge around it. Specifically, the questioner identifies an entity in the image and asks a question involving that entity which can be answered only by consulting a knowledge graph or corpus passage mentioning the entity. Our challenge consists of (i)OKVQA_S3, a subset of OKVQA annotated based on the structural idiom and (ii)S3VQA, a new dataset built from scratch. We also present a neural but structurally transparent OKVQA system, S3, that explicitly addresses our challenge dataset, and outperforms recent competitive baselines.
@inproceedings{jain2021select, title = {Select, substitute, search: A new benchmark for knowledge-augmented visual question answering}, author = {Jain, Aman and Kothyari, Mayank and Kumar, Vishwajeet and Jyothi, Preethi and Ramakrishnan, Ganesh and Chakrabarti, Soumen}, booktitle = {Proceedings of SIGIR}, pages = {2491--2498}, year = {2021}, }
ICASSP
An investigation of end-to-end models for robust speech recognition

Archiki Prasad, Preethi Jyothi, and Rajbabu Velmurugan

In Proceedings of ICASSP, 2021

Abs Bib PDF Code

End-to-end models for robust automatic speech recognition (ASR) have not been sufficiently well-explored in prior work. With end-to-end models, one could choose to preprocess the input speech using speech enhancement techniques and train the model using enhanced speech. Another alternative is to pass the noisy speech as input and modify the model archi- tecture to adapt to noisy speech. A systematic comparison of these two approaches for end-to-end robust ASR has not been attempted before. We address this gap and present a de- tailed comparison of speech enhancement-based techniques and three different model-based adaptation techniques cov- ering data augmentation, multi-task learning, and adversarial learning for robust ASR. While adversarial learning is the best-performing technique on certain noise types, it comes at the cost of degrading clean speech WER. On other relatively stationary noise types, a new speech enhancement technique outperformed all the model-based adaptation techniques. This suggests that knowledge of the underlying noise type can meaningfully inform the choice of adaptation technique.
@inproceedings{prasad2021investigation, title = {An investigation of end-to-end models for robust speech recognition}, author = {Prasad, Archiki and Jyothi, Preethi and Velmurugan, Rajbabu}, booktitle = {Proceedings of ICASSP}, pages = {6893--6897}, year = {2021} }
ICASSP
Error-driven fixed-budget asr personalization for accented speakers

Abhijeet Awasthi, Aman Kansal, Sunita Sarawagi, and 1 more author

In Proceedings of ICASSP, 2021

Abs Bib PDF

We consider the task of personalizing ASR models while being constrained by a fixed budget on recording speaker specific utterances. Given a speaker and an ASR model, we propose a method of identifying sentences for which the speaker’s utterances are likely to be harder for the given ASR model to recognize. We assume a tiny amount of speaker-specific data to learn phoneme-level error models which help us select such sentences. We show that speaker’s utterances on the sentences selected using our error model indeed have larger error rates when compared to speaker’s utterances on randomly selected sentences. We find that fine-tuning the ASR model on the sentence utterances selected with the help of error models yield higher WER improvements in comparison to fine-tuning on an equal number of randomly selected sentence utterances. Thus, our method provides an efficient way of collecting speaker utterances under budget constraints for personalizing ASR models.
@inproceedings{awasthi2021error, title = {Error-driven fixed-budget asr personalization for accented speakers}, author = {Awasthi, Abhijeet and Kansal, Aman and Sarawagi, Sunita and Jyothi, Preethi}, booktitle = {Proceedings of ICASSP}, pages = {7033--7037}, year = {2021}, }
ICASSP
Collaborative learning to generate audio-video jointly

Vinod K Kurmi, Vipul Bajaj, Badri N Patro, and 3 more authors

In Proceedings of ICASSP, 2021

Abs Bib PDF

There have been a number of techniques that have demonstrated the generation of multimedia data for one modality at a time using GANs, such as the ability to generate images, videos, and audio. However, so far, the task of multi-modal generation of data, specifically for audio and videos both, has not been sufficiently well-explored. Towards this, we propose a method that demonstrates that we are able to generate naturalistic samples of video and audio data by the joint correlated generation of audio and video modalities. The proposed method uses multiple discriminators to ensure that the audio, video, and the joint output are also indistinguishable from real-world samples. We present a dataset for this task and show that we are able to generate realistic samples. This method is validated using various standard metrics such as Inception Score, Frechet Inception Distance (FID) and through human evaluation.
@inproceedings{kurmi2021collaborative, title = {Collaborative learning to generate audio-video jointly}, author = {Kurmi, Vinod K and Bajaj, Vipul and Patro, Badri N and Venkatesh, KS and Namboodiri, Vinay P and Jyothi, Preethi}, booktitle = {Proceedings of ICASSP}, pages = {4180--4184}, year = {2021} }
EACL
Disfluency correction using unsupervised and semi-supervised learning

Nikhil Saini, Drumil Trivedi, Shreya Khare, and 4 more authors

In Proceedings of EACL, 2021

Abs Bib PDF

Spoken language is different from the written language in its style and structure. Disfluencies that appear in transcriptions from speech recognition systems generally hamper the performance of downstream NLP tasks. Thus, a disfluency correction system that converts disfluent to fluent text is of great value. This paper introduces a disfluency correction model that translates disfluent to fluent text by drawing inspiration from recent encoder-decoder unsupervised style-transfer models for text. We also show considerable benefits in performance when utilizing a small sample of 500 parallel disfluent-fluent sentences in a semi-supervised way. Our unsupervised approach achieves a BLEU score of 79.39 on the Switchboard corpus test set, with further improvement to a BLEU score of 85.28 with semi-supervision. Both are comparable to two competitive fully-supervised models.
@inproceedings{saini2021disfluency, title = {Disfluency correction using unsupervised and semi-supervised learning}, author = {Saini, Nikhil and Trivedi, Drumil and Khare, Shreya and Dhamecha, Tejas and Jyothi, Preethi and Bharadwaj, Samarth and Bhattacharyya, Pushpak}, booktitle = {Proceedings of EACL}, pages = {3421--3427}, year = {2021} }
EACL
Meta-Learning for Effective Multi-task and Multilingual Modelling

Ishan Tarunesh, Sushil Khyalia, Vishwajeet Kumar, and 2 more authors

In Proceedings of EACL, 2021

Abs Bib PDF

Natural language processing (NLP) tasks (e.g. question-answering in English) benefit from knowledge of other tasks (e.g. named entity recognition in English) and knowledge of other languages (e.g. question-answering in Spanish). Such shared representations are typically learned in isolation, either across tasks or across languages. In this work, we propose a meta-learning approach to learn the interactions between both tasks and languages. We also investigate the role of different sampling strategies used during meta-learning. We present experiments on five different tasks and six different languages from the XTREME multilingual benchmark dataset. Our meta-learned model clearly improves in performance compared to competitive baseline models that also include multi-task baselines. We also present zero-shot evaluations on unseen target languages to demonstrate the utility of our proposed model.
@inproceedings{tarunesh2021meta, title = {Meta-Learning for Effective Multi-task and Multilingual Modelling}, author = {Tarunesh, Ishan and Khyalia, Sushil and Kumar, Vishwajeet and Ramakrishnan, Ganesh and Jyothi, Preethi}, booktitle = {Proceedings of EACL}, pages = {3600--3612}, year = {2021} }

2020

Interspeech
Black-Box Adaptation of ASR for Accented Speech

Kartik Khandelwal, Preethi Jyothi, Abhijeet Awasthi, and 1 more author

In Proceedings of Interspeech, 2020

Abs Bib PDF Code

We introduce the problem of adapting a black-box, cloud-based ASR system to speech from a target accent. While leading online ASR services obtain impressive performance on main-stream accents, they perform poorly on sub-populations - we observed that the word error rate (WER) achieved by Google’s ASR API on Indian accents is almost twice the WER on US accents. Existing adaptation methods either require access to model parameters or overlay an error-correcting module on output transcripts. We highlight the need for correlating outputs with the original speech to fix accent errors. Accordingly, we propose a novel coupling of an open-source accent-tuned local model with the black-box service where the output from the service guides frame-level inference in the local model. Our fine-grained merging algorithm is better at fixing accent errors than existing word-level combination strategies. Experiments on Indian and Australian accents with three leading ASR models as service, show that we achieve as much as 28% relative reduction in WER over both the local and service models.
@inproceedings{khandelwal2020black, title = {Black-Box Adaptation of ASR for Accented Speech}, author = {Khandelwal, Kartik and Jyothi, Preethi and Awasthi, Abhijeet and Sarawagi, Sunita}, booktitle = {Proceedings of Interspeech}, pages = {1281--1285}, year = {2020} }
Interspeech
Improving Low Resource Code-Switched ASR Using Augmented Code-Switched TTS

Yash Sharma, Basil Abraham, Karan Taneja, and 1 more author

In Proceedings of Interspeech, 2020

Abs Bib PDF

Building Automatic Speech Recognition (ASR) systems for code-switched speech has recently gained renewed attention due to the widespread use of speech technologies in multilingual communities worldwide. End-to-end ASR systems are a natural modeling choice due to their ease of use and superior performance in monolingual settings. However, it is well known that end-to-end systems require large amounts of labeled speech. In this work, we investigate improving code-switched ASR in low resource settings via data augmentation using code-switched text-to-speech (TTS) synthesis. We propose two targeted techniques to effectively leverage TTS speech samples: 1) Mixup, an existing technique to create new training samples via linear interpolation of existing samples, applied to TTS and real speech samples, and 2) a new loss function, used in conjunction with TTS samples, to encourage code-switched predictions. We report significant improvements in ASR performance achieving absolute word error rate (WER) reductions of up to 5%, and measurable improvement in code switching using our proposed techniques on a Hindi-English code-switched ASR task.
@inproceedings{sharma2020improving, title = {Improving Low Resource Code-Switched ASR Using Augmented Code-Switched TTS}, author = {Sharma, Yash and Abraham, Basil and Taneja, Karan and Jyothi, Preethi}, booktitle = {Proceedings of Interspeech}, pages = {4771--4775}, year = {2020} }
Interspeech
Caption alignment for low resource audio-visual data

Vighnesh Reddy Konda, Mayur Warialani, Rakesh Prasanth Achari, and 6 more authors

In Proceedings of Interspeech, 2020

Abs Bib PDF

Understanding videos via captioning has gained a lot of traction recently. While captions are provided alongside videos, the information about where a caption aligns within a video is missing, which could be particularly useful for indexing and retrieval. Existing work on learning to infer alignments has mostly exploited visual features and ignored the audio signal. Video understanding applications often underestimate the importance of the audio modality. We focus on how to make effective use of the audio modality for temporal localization of captions within videos. We release a new audio-visual dataset that has captions time-aligned by (i) carefully listening to the audio and watching the video, and (ii) watching only the video. Our dataset is audio-rich and contains captions in two languages, English and Marathi (a low-resource language). We further propose an attention-driven multimodal model, for effective utilization of both audio and video for temporal localization. We then investigate (i) the effects of audio in both data preparation and model design, and (ii) effective pretraining strategies (Audioset, ASR-bottleneck features, PASE, etc.) handling low-resource setting to help extract rich audio representations.
@inproceedings{konda2020caption, title = {Caption alignment for low resource audio-visual data}, author = {Konda, Vighnesh Reddy and Warialani, Mayur and Achari, Rakesh Prasanth and Bhatnagar, Varad and Akula, Jayaprakash and Jyothi, Preethi and Ramakrishnan, Ganesh and Haffari, Gholamreza and Singh, Pankaj}, booktitle = {Proceedings of Interspeech}, pages = {3525--3529}, year = {2020}, }
ACL
How accents confound: Probing for accent information in end-to-end speech recognition systems

Archiki Prasad and Preethi Jyothi

In Proceedings of ACL, 2020

Abs Bib PDF Code

In this work, we present a detailed analysis of how accent information is reflected in the internal representation of speech in an end-to-end automatic speech recognition (ASR) system. We use a state-of-the-art end-to-end ASR system, comprising convolutional and recurrent layers, that is trained on a large amount of US-accented English speech and evaluate the model on speech samples from seven different English accents. We examine the effects of accent on the internal representation using three main probing techniques: a) Gradient-based explanation methods, b) Information-theoretic measures, and c) Outputs of accent and phone classifiers. We find different accents exhibiting similar trends irrespective of the probing technique used. We also find that most accent information is encoded within the first recurrent layer, which is suggestive of how one could adapt such an end-to-end model to learn representations that are invariant to accents.
@inproceedings{prasad2020accents, title = {How accents confound: Probing for accent information in end-to-end speech recognition systems}, author = {Prasad, Archiki and Jyothi, Preethi}, booktitle = {Proceedings of ACL}, pages = {3739--3753}, year = {2020}, }
ICASSP
Coupled training of sequence-to-sequence models for accented speech recognition

Vinit Unni, Nitish Joshi, and Preethi Jyothi

In Proceedings of ICASSP, 2020

Abs Bib PDF

Accented speech poses significant challenges for state-of-the-art automatic speech recognition (ASR) systems. Accent is a property of speech that lasts throughout an utterance in varying degrees of strength. This makes it hard to isolate the influence of accent on individual speech sounds. We propose coupled training for encoder-decoder ASR models that acts on pairs of utterances corresponding to the same text spoken by speakers with different accents. This training regime introduces an L2 loss between the attention-weighted representations corresponding to pairs of utterances with the same text, thus acting as a regularizer and encouraging representations from the encoder to be more accent-invariant. We focus on recognizing accented English samples from the Mozilla Common Voice corpus. We obtain significant error rate reductions on accented samples from a large set of diverse accents using coupled training. We also show consistent improvements in performance on heavily accented samples (as determined by a standalone accent classifier).
@inproceedings{unni2020coupled, title = {Coupled training of sequence-to-sequence models for accented speech recognition}, author = {Unni, Vinit and Joshi, Nitish and Jyothi, Preethi}, booktitle = {Proceedings of ICASSP}, pages = {8254--8258}, year = {2020} }
LREC
Crowdsourcing speech data for low-resource languages from low-income workers

Basil Abraham, Danish Goel, Divya Siddarth, and 7 more authors

In Proceedings of LREC, 2020

Abs Bib PDF Website

Voice-based technologies are essential to cater to the hundreds of millions of new smartphone users. However, most of the languages spoken by these new users have little to no labelled speech data. Unfortunately, collecting labelled speech data in any language is an expensive and resource-intensive task. Moreover, existing platforms typically collect speech data only from urban speakers familiar with digital technology whose dialects are often very different from low-income users. In this paper, we explore the possibility of collecting labelled speech data directly from low-income workers. In addition to providing diversity to the speech dataset, we believe this approach can also provide valuable supplemental earning opportunities to these communities. To this end, we conducted a study where we collected labelled speech data in the Marathi language from three different user groups: low-income rural users, low-income urban users, and university students. Overall, we collected 109 hours of data from 36 participants. Our results show that the data collected from low-income participants is of comparable quality to the data collected from university students (who are typically employed to do this work) and that crowdsourcing speech data from low-income rural and urban workers is a viable method of gathering speech data.
@inproceedings{abraham2020crowdsourcing, title = {Crowdsourcing speech data for low-resource languages from low-income workers}, author = {Abraham, Basil and Goel, Danish and Siddarth, Divya and Bali, Kalika and Chopra, Manu and Choudhury, Monojit and Joshi, Pratik and Jyoti, Preethi and Sitaram, Sunayana and Seshadri, Vivek}, booktitle = {Proceedings of LREC}, pages = {2819--2826}, year = {2020}, }
IWSLT
Generating fluent translations from disfluent text without access to fluent references: IIT Bombay@ IWSLT2020

Nikhil Saini, Jyotsana Khatri, Preethi Jyothi, and 1 more author

In Proceedings of IWSLT, 2020

Abs Bib PDF

Machine translation systems perform reasonably well when the input is well-formed speech or text. Conversational speech is spontaneous and inherently consists of many disfluencies. Producing fluent translations of disfluent source text would typically require parallel disfluent to fluent training data. However, fluent translations of spontaneous speech are an additional resource that is tedious to obtain. This work describes the submission of IIT Bombay to the Conversational Speech Translation challenge at IWSLT 2020. We specifically tackle the problem of disfluency removal in disfluent-to-fluent text-to-text translation assuming no access to fluent references during training. Common patterns of disfluency are extracted from disfluent references and a noise induction model is used to simulate them starting from a clean monolingual corpus. This synthetically constructed dataset is then considered as a proxy for labeled data during training. We also make use of additional fluent text in the target language to help generate fluent translations. This work uses no fluent references during training and beats a baseline model by a margin of 4.21 and 3.11 BLEU points where the baseline uses disfluent and fluent references, respectively. Index Terms-disfluency removal, machine translation, noise induction, leveraging monolingual data, denoising for disfluency removal.
@inproceedings{saini2020generating, title = {Generating fluent translations from disfluent text without access to fluent references: IIT Bombay@ IWSLT2020}, author = {Saini, Nikhil and Khatri, Jyotsana and Jyothi, Preethi and Bhattacharyya, Pushpak}, booktitle = {Proceedings of IWSLT}, pages = {178--186}, year = {2020}, }

2019

ACL
Cross-Lingual Training for Automatic Question Generation

Vishwajeet Kumar, Nitish Joshi, Arijit Mukherjee, and 2 more authors

In Proceedings of ACL, 2019

Abs Bib PDF

Automatic question generation (QG) is a challenging problem in natural language understanding. QG systems are typically built assuming access to a large number of training instances where each instance is a question and its corresponding answer. For a new language, such training instances are hard to obtain making the QG problem even more challenging. Using this as our motivation, we study the reuse of an available large QG dataset in a secondary language (e.g. English) to learn a QG model for a primary language (e.g. Hindi) of interest. For the primary language, we assume access to a large amount of monolingual text but only a small QG dataset. We propose a cross-lingual QG model which uses the following training regime: (i) Unsupervised pretraining of language models in both primary and secondary languages and (ii) joint supervised training for QG in both languages. We demonstrate the efficacy of our proposed approach using two different primary languages, Hindi and Chinese. We also create and release a new question answering dataset for Hindi consisting of 6555 sentences.
@inproceedings{kumar2019cross, title = {Cross-Lingual Training for Automatic Question Generation}, author = {Kumar, Vishwajeet and Joshi, Nitish and Mukherjee, Arijit and Ramakrishnan, Ganesh and Jyothi, Preethi}, booktitle = {Proceedings of ACL}, pages = {4863--4872}, year = {2019}, }
Interspeech
Exploiting Monolingual Speech Corpora for Code-Mixed Speech Recognition.

Karan Taneja, Satarupa Guha, Preethi Jyothi, and 1 more author

In Proceedings of Interspeech, 2019

Abs Bib PDF

One of the main challenges in building code-mixed ASR systems is the lack of annotated speech data. Often, however, monolingual speech corpora are available in abundance for the languages in the code-mixed speech. In this paper, we explore different techniques that use monolingual speech to create synthetic code-mixed speech and examine their effect on training models for code-mixed ASR. We assume access to a small amount of real code-mixed text, from which we extract probability distributions that govern the transition of phones across languages at code-switch boundaries and the span lengths corresponding to a particular language. We extract segments from monolingual data and concatenate them to form code-mixed utterances such that these probability distributions are preserved. Using this synthetic speech, we show significant improvements in Hindi-English code-mixed ASR performance compared to using synthetic speech naively constructed from complete utterances in different languages. We also present language modelling experiments that use synthetically constructed codemixed text and discuss their benefits.
@inproceedings{taneja2019exploiting, title = {Exploiting Monolingual Speech Corpora for Code-Mixed Speech Recognition.}, author = {Taneja, Karan and Guha, Satarupa and Jyothi, Preethi and Abraham, Basil}, booktitle = {Proceedings of Interspeech}, pages = {2150--2154}, year = {2019} }

2018

EMNLP
Revisiting the Importance of Encoding Logic Rules in Sentiment Classification

Kalpesh Krishna, Preethi Jyothi, and Mohit Iyyer

In Proceedings of EMNLP, 2018

Abs Bib PDF Code

We analyze the performance of different sentiment classification models on syntactically complex inputs like A-but-B sentences. The first contribution of this analysis addresses reproducible research: to meaningfully compare different models, their accuracies must be averaged over far more random seeds than what has traditionally been reported. With proper averaging in place, we notice that the distillation model, which incorporates explicit logic rules for sentiment classification, is ineffective. In contrast, using contextualized ELMo embeddings instead of logic rules yields significantly better performance. Additionally, we provide analysis and visualizations that demonstrate ELMo’s ability to implicitly learn logic rules. Finally, a crowdsourced analysis reveals how ELMo outperforms baseline models even on sentences with ambiguous sentiment labels.
@inproceedings{krishna2018revisiting, title = {Revisiting the Importance of Encoding Logic Rules in Sentiment Classification}, author = {Krishna, Kalpesh and Jyothi, Preethi and Iyyer, Mohit}, booktitle = {Proceedings of EMNLP}, pages = {4743--4751}, year = {2018}, }
EMNLP
Code-switched Language Models Using Dual RNNs and Same-Source Pretraining

Saurabh Garg, Tanmay Parekh, and Preethi Jyothi

In Proceedings of EMNLP, 2018

Abs Bib PDF

This work focuses on building language models (LMs) for code-switched text. We propose two techniques that significantly improve these LMs: 1) A novel recurrent neural network unit with dual components that focus on each language in the code-switched text separately 2) Pretraining the LM using synthetic text from a generative model estimated using the training data. We demonstrate the effectiveness of our proposed techniques by reporting perplexities on a Mandarin-English task and derive significant reductions in perplexity.
@inproceedings{garg2018code, title = {Code-switched Language Models Using Dual RNNs and Same-Source Pretraining}, author = {Garg, Saurabh and Parekh, Tanmay and Jyothi, Preethi}, booktitle = {Proceedings of EMNLP}, pages = {3078--3083}, year = {2018}, }
Interspeech
Improved Accented Speech Recognition Using Accent Embeddings and Multi-task Learning.

Abhinav Jain, Minali Upreti, and Preethi Jyothi

In Proceedings of Interspeech, 2018

Abs Bib PDF

One of the major remaining challenges in modern automatic speech recognition (ASR) systems for English is to be able to handle speech from users with a diverse set of accents. ASR systems that are trained on speech from multiple English accents still underperform when confronted with a new speech accent. In this work, we explore how to use accent embeddings and multi-task learning to improve speech recognition for accented speech. We propose a multi-task architecture that jointly learns an accent classifier and a multi-accent acoustic model. We also consider augmenting the speech input with accent information in the form of embeddings extracted by a separate network. These techniques together give significant relative performance improvements of 15% and 10% over a multi-accent baseline system on test sets containing seen and unseen accents, respectively.
@inproceedings{jain2018improved, title = {Improved Accented Speech Recognition Using Accent Embeddings and Multi-task Learning.}, author = {Jain, Abhinav and Upreti, Minali and Jyothi, Preethi}, booktitle = {Proceedings of Interspeech}, pages = {2454--2458}, year = {2018} }
Interspeech
Dual Language Models for Code Switched Speech Recognition

Saurabh Garg, Tanmay Parekh, and Preethi Jyothi

In Proceedings of Interspeech, 2018

Abs Bib PDF

In this work, we present a simple and elegant approach to language modeling for bilingual code-switched text. Since code-switching is a blend of two or more different languages, a standard bilingual language model can be improved upon by using structures of the monolingual language models. We propose a novel technique called dual language models, which involves building two complementary monolingual language models and combining them using a probabilistic model for switching between the two. We evaluate the efficacy of our approach using a conversational Mandarin-English speech corpus. We prove the robustness of our model by showing significant improvements in perplexity measures over the standard bilingual language model without the use of any external information. Similar consistent improvements are also reflected in automatic speech recognition error rates.
@inproceedings{garg2018dual, title = {Dual Language Models for Code Switched Speech Recognition}, author = {Garg, Saurabh and Parekh, Tanmay and Jyothi, Preethi}, booktitle = {Proceedings of Interspeech}, pages = {2598--2602}, year = {2018} }
Interspeech
Time Aggregation Operators for Multi-label Audio Event Detection.

Pankaj Joshi, Digvijaysingh Gautam, Ganesh Ramakrishnan, and 1 more author

In Proceedings of Interspeech, 2018

Abs Bib PDF

In this paper, we present a state-of-the-art system for audio event detection. The labels on the training (and evaluation) data specify the set of events occurring in each audio clip, but neither the time spans nor the order in which they occur. Specifically, our task of weakly supervised learning is the “Detection and Classification of Acoustic Scenes and Events (DCASE) 2017” challenge [5]. We use the winning entry in this challenge given by Xu et al.[10] as our starting point and identify several important modifications that allow us to improve on their results significantly. Our techniques pertain to aggregation and consolidation over time and frequency signals over a (temporal) sequence before decoding the labels. In general, our work is also relevant to other tasks involving learning from weak labeling of sequential data.
@inproceedings{joshi2018time, title = {Time Aggregation Operators for Multi-label Audio Event Detection.}, author = {Joshi, Pankaj and Gautam, Digvijaysingh and Ramakrishnan, Ganesh and Jyothi, Preethi}, booktitle = {Proceedings of Interspeech}, pages = {3309--3313}, year = {2018}, }
ICLR
Generalizing Across Domains via Cross-Gradient Training

Shiv Shankar, Vihari Piratla, Soumen Chakrabarti, and 3 more authors

In Proceedings of ICLR, 2018

Abs Bib PDF Code

We present CROSSGRAD, a method to use multi-domain training data to learn a classifier that generalizes to new domains. CROSSGRAD does not need an adaptation phase via labeled or unlabeled data, or domain features in the new domain. Most existing domain adaptation methods attempt to erase domain signals using techniques like domain adversarial training. In contrast, CROSSGRAD is free to use domain signals for predicting labels, if it can prevent overfitting on training domains. We conceptualize the task in a Bayesian setting, in which a sampling step is implemented as data augmentation, based on domain-guided perturbations of input instances. CROSSGRAD parallelly trains a label and a domain classifier on examples perturbed by loss gradients of each other’s objectives. This enables us to directly perturb inputs, without separating and re-mixing domain signals while making various distributional assumptions. Empirical evaluation on three different applications where this setting is natural establishes that (1) domain-guided perturbation provides consistently better generalization to unseen domains, compared to generic instance perturbation methods, and that (2) data augmentation is a more stable and accurate method than domain adversarial training.
@inproceedings{shankar2018generalizing, title = {Generalizing Across Domains via Cross-Gradient Training}, author = {Shankar, Shiv and Piratla, Vihari and Chakrabarti, Soumen and Chaudhuri, Siddhartha and Jyothi, Preethi and Sarawagi, Sunita}, booktitle = {Proceedings of ICLR}, year = {2018}, }

2017

ASRU
Leveraging native language speech for accent identification using deep siamese networks

Aditya Siddhant, Preethi Jyothi, and Sriram Ganapathy

In Proceedings of ASRU, 2017

Abs Bib PDF

The problem of automatic accent identification is important for several applications like speaker profiling and recognition as well as for improving speech recognition systems. The accented nature of speech can be primarily attributed to the influence of the speaker’s native language on the given speech recording. In this paper, we propose a novel accent identification system whose training exploits speech in native languages along with the accented speech. Specifically, we develop a deep Siamese network based model which learns the association between accented speech recordings and the native language speech recordings. The Siamese networks are trained with i-vector features extracted from the speech recordings using either an unsupervised Gaussian mixture model (GMM) or a supervised deep neural network (DNN) model. We perform several accent identification experiments using the CSLU Foreign Accented English (FAE) corpus. In these experiments, our proposed approach using deep Siamese networks yield significant relative performance improvements of 15.4% on a 10-class accent identification task, over a baseline DNN-based classification system that uses GMM i-vectors. Furthermore, we present a detailed error analysis of the proposed accent identification system.
@inproceedings{siddhant2017leveraging, title = {Leveraging native language speech for accent identification using deep siamese networks}, author = {Siddhant, Aditya and Jyothi, Preethi and Ganapathy, Sriram}, booktitle = {Proceedings of ASRU}, pages = {621--628}, year = {2017}, }
Asilomar
Mismatched crowdsourcing: Mining latent skills to acquire speech transcriptions

Mark Hasegawa-Johnson, Preethi Jyothi, Wenda Chen, and 1 more author

In Proceedings of Asilomar, 2017

Abs Bib PDF

Automatic speech recognition (ASR) converts audio to text. ASR is usually trained using a large quantity of labeled data, i.e., audio with text transcription. In many languages, however, text transcription is hard to find, e.g., in both Hokkien and Dinka, we found native speakers who had received all their primary education in some other language, and who therefore had difficulty writing in their own language. Fortunately, speech in every language is produced by human mouths, and designed to be interpreted by human ears. Speakers of a majority language (English, say, or Mandarin Chinese) are therefore able to make some sense of even the strangest language (Zulu, say, or Cantonese): language-unique distinctions are mostly lost, but universal distinctions such as consonant versus vowel are, for the most part, correctly transmitted. We can decode such mismatched transcripts using an information-theoretic decoder, resulting in a low-entropy probability distribution over the possible native-language transcriptions. Mismatched transcripts can be used to train ASR. Combining ten hours of mismatched transcripts with 12–48 minutes of native transcripts, if available, results in lower phone error rate. On the other hand, if we don’t even know the native phoneme inventory, mismatched transcripts in two or more annotation languages can be used to infer the native phoneme inventory (with entropy depending on the distinctive feature inventory of the annotation languages).
@inproceedings{hasegawa2017mismatched, title = {Mismatched crowdsourcing: Mining latent skills to acquire speech transcriptions}, author = {Hasegawa-Johnson, Mark and Jyothi, Preethi and Chen, Wenda and Do, Van Hai}, booktitle = {Proceedings of Asilomar}, pages = {1277--1281}, year = {2017}, }
ICASSP
Low-resource grapheme-to-phoneme conversion using recurrent neural networks

Preethi Jyothi and Mark Hasegawa-Johnson

In Proceedings of ICASSP, 2017

Abs Bib PDF

Grapheme-to-phoneme (G2P) conversion is an important problem for many speech and language processing applications. G2P models are particularly useful for low-resource languages that do not have well-developed pronunciation lexicons. Prominent G2P paradigms are based on initial alignments between grapheme and phoneme sequences. In this work, we devise new alignment strategies that work effectively with recurrent neural network based models when only a small number of pronunciations are available to train the models. In a small data setting, we build G2P models for Pashto, Tagalog and Lithuanian that significantly outperform a joint sequence model and a baseline recurrent neural network based model, giving up to 14% and 9% relative reductions in phone and word error rates when trained on a dataset of 250 words.
@inproceedings{jyothi2017low, title = {Low-resource grapheme-to-phoneme conversion using recurrent neural networks}, author = {Jyothi, Preethi and Hasegawa-Johnson, Mark}, booktitle = {Proceedings of ICASSP}, pages = {5030--5034}, year = {2017}, }

2016

TASL
ASR for under-resourced languages from probabilistic transcription

Mark A Hasegawa-Johnson, Preethi Jyothi, Daniel McCloy, and 8 more authors

IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016

Abs Bib PDF

In many under-resourced languages it is possible to find text, and it is possible to find speech, but transcribed speech suitable for training automatic speech recognition (ASR) is unavailable. In the absence of native transcripts, this paper proposes the use of a probabilistic transcript: A probability mass function over possible phonetic transcripts of the waveform. Three sources of probabilistic transcripts are demonstrated. First, self-training is a well-established semisupervised learning technique, in which a cross-lingual ASR first labels unlabeled speech, and is then adapted using the same labels. Second, mismatched crowdsourcing is a recent technique in which nonspeakers of the language are asked to write what they hear, and their nonsense transcripts are decoded using noisy channel models of second-language speech perception. Third, EEG distribution coding is a new technique in which nonspeakers of the language listen to it, and their electrocortical response signals are interpreted to indicate probabilities. ASR was trained in four languages without native transcripts. Adaptation using mismatched crowdsourcing significantly outperformed self-training, and both significantly outperformed a cross-lingual baseline. Both EEG distribution coding and text-derived phone language models were shown to improve the quality of probabilistic transcripts derived from mismatched crowdsourcing.
@article{hasegawa2016asr, title = {ASR for under-resourced languages from probabilistic transcription}, author = {Hasegawa-Johnson, Mark A and Jyothi, Preethi and McCloy, Daniel and Mirbagheri, Majid and Di Liberto, Giovanni M and Das, Amit and Ekin, Bradley and Liu, Chunxi and Manohar, Vimal and Tang, Hao and others}, journal = {IEEE/ACM Transactions on Audio, Speech, and Language Processing}, volume = {25}, number = {1}, pages = {50--63}, year = {2016}, }
COLING Workshop
Clustering-based phonetic projection in mismatched crowdsourcing channels for low-resourced ASR

Wenda Chen, Mark Hasegawa-Johnson, Nancy Chen, and 2 more authors

In Proceedings of the Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016), COLING, 2016

Abs Bib PDF

Acquiring labeled speech for low-resource languages is a difficult task in the absence of native speakers of the language. One solution to this problem involves collecting speech transcriptions from crowd workers who are foreign or non-native speakers of a given target language. From these mismatched transcriptions, one can derive probabilistic phone transcriptions that are de- fined over the set of all target language phones using a noisy channel model. This paper extends prior work on deriving probabilistic transcriptions (PTs) from mismatched transcriptions by 1) modelling multilingual channels and 2) introducing a clustering-based phonetic mapping tech- nique to improve the quality of PTs. Mismatched crowdsourcing for multilingual channels has certain properties of projection mapping, e.g., it can be interpreted as a clustering based on singu- lar value decomposition of the segment alignments. To this end, we explore the use of distinctive feature weights, lexical tone confusions, and a two-step clustering algorithm to learn projections of phoneme segments from mismatched multilingual transcriber languages to the target language. We evaluate our techniques using mismatched transcriptions for Cantonese speech acquired from native English and Mandarin speakers. We observe a 5–9% relative reduction in phone error rate for the predicted Cantonese phone transcriptions using our proposed techniques compared with the previous PT method.
@inproceedings{chen2016clustering, title = {Clustering-based phonetic projection in mismatched crowdsourcing channels for low-resourced ASR}, author = {Chen, Wenda and Hasegawa-Johnson, Mark and Chen, Nancy and Jyothi, Preethi and Varshney, Lav}, booktitle = {Proceedings of the Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016), COLING}, pages = {133--141}, year = {2016}, }
Interspeech
Automatic Speech Recognition Using Probabilistic Transcriptions in Swahili, Amharic, and Dinka.

Amit Das, Preethi Jyothi, and Mark Hasegawa-Johnson

In Proceedings of Interspeech, 2016

Abs Bib PDF

In this study, we develop automatic speech recognition systems for three sub-Saharan African languages using probabilistic transcriptions collected from crowd workers who neither speak nor have any familiarity with the African languages. The three African languages in consideration are Swahili, Amharic, and Dinka. There is a language mismatch in this scenario. More specifically, utterances spoken in African languages were transcribed by crowd workers who were mostly native speakers of English. Due to this, such transcriptions are highly prone to label inaccuracies. First, we use a recently introduced technique called mismatched crowdsourcing which processes the raw crowd transcriptions to confusion networks. Next, we adapt both multilingual hidden Markov models (HMM) and deep neural network (DNN) models using the probabilistic transcriptions of the African languages. Finally, we report the results using both deterministic and probabilistic phone error rates (PER). Automatic speech recognition systems developed using this recipe are particularly useful for low resource languages where there is limited access to linguistic resources and/or transcribers in the native language.
@inproceedings{das2016automatic, title = {Automatic Speech Recognition Using Probabilistic Transcriptions in Swahili, Amharic, and Dinka.}, author = {Das, Amit and Jyothi, Preethi and Hasegawa-Johnson, Mark}, booktitle = {Proceedings of Interspeech}, pages = {3524--3528}, year = {2016}, }
SLTU
Performance improvement of probabilistic transcriptions with language-specific constraints

Xiang Kong, Preethi Jyothi, and Mark Hasegawa-Johnson

Proceedings of SLTU Workshop, 2016

Abs Bib PDF

This article describes a method for reducing the error rate of probabilistic phone-based transcriptions resulting from mismatched crowdsourcing by using language-specific constraints to post-process the phone sequence. In the scenario under consideration, there are no native-language transcriptions or pronunciation dictionary available in the test language; instead, available resources include non-native transcriptions, a rudimentary rule-based G2P, and a list of orthographic word forms mined from the internet. The proposed solution post-processes non-native transcriptions by converting them to test-language orthography, composing with testlanguage word forms, then converting back to a phone string. Experiments demonstrate that the phone error rate of the transcription is reduced, using this method, by 22% on an independent evaluation-test dataset.
@article{kong2016performance, title = {Performance improvement of probabilistic transcriptions with language-specific constraints}, author = {Kong, Xiang and Jyothi, Preethi and Hasegawa-Johnson, Mark}, journal = {Proceedings of SLTU Workshop}, volume = {81}, pages = {30--36}, year = {2016}, }
CSL
Articulatory feature-based pronunciation modeling

Karen Livescu, Preethi Jyothi, and Eric Fosler-Lussier

Computer Speech & Language, 2016

Abs Bib PDF

Spoken language, especially conversational speech, is characterized by great variability in word pronunciation, including many variants that differ grossly from dictionary prototypes. This is one factor in the poor performance of automatic speech recognizers on conversational speech, and it has been very difficult to mitigate in traditional phone- based approaches to speech recognition. An alternative approach, which has been studied by ourselves and others, is one based on sub-phonetic features rather than phones. In such an approach, a word’s pronunciation is represented as multiple streams of phonological features rather than a single stream of phones. Features may correspond to the positions of the speech articulators, such as the lips and tongue, or may be more abstract categories such as manner and place. This article reviews our work on a particular type of articulatory feature-based pronunciation model. The model allows for asynchrony between features, as well as per-feature substitutions, making it more natural to account for many pronunciation changes that are difficult to handle with phone-based models. Such models can be efficiently represented as dynamic Bayesian networks. The feature-based models improve significantly over phone-based coun- terparts in terms of frame perplexity and lexical access accuracy. The remainder of the article discusses related work and future directions.
@article{livescu2016articulatory, title = {Articulatory feature-based pronunciation modeling}, author = {Livescu, Karen and Jyothi, Preethi and Fosler-Lussier, Eric}, journal = {Computer Speech \& Language}, volume = {36}, pages = {212--232}, year = {2016}, }
ICASSP
Adapting ASR for under-resourced languages using mismatched transcriptions

*Chunxi Liu, *Preethi Jyothi, Hao Tang, and 5 more authors

In Proceedings of ICASSP
This work received an Speech and Language Processing Student Paper Award , 2016

Abs Bib PDF

Mismatched transcriptions of speech in a target language refers to transcriptions provided by people unfamiliar with the language, using English letter sequences. In this work, we demonstrate the value of such transcriptions in building an ASR system for the target language. For different languages, we use less than an hour of mismatched transcriptions to successfully adapt baseline multilingual models built with no access to native transcriptions in the target language. The adapted models provide up to 25% relative improvement in phone error rates on an unseen evaluation set.
@inproceedings{liu2016adapting, title = {Adapting ASR for under-resourced languages using mismatched transcriptions}, author = {Liu, Chunxi and Jyothi, Preethi and Tang, Hao and Manohar, Vimal and Sloan, Rose and Kekona, Tyler and Hasegawa-Johnson, Mark and Khudanpur, Sanjeev}, booktitle = {Proceedings of ICASSP}, pages = {5840--5844}, year = {2016}, }
ITA
Language coverage for mismatched crowdsourcing

Lav R Varshney, Preethi Jyothi, and Mark Hasegawa-Johnson

In 2016 Information Theory and Applications Workshop (ITA), 2016

Abs Bib PDF

Developing automatic speech recognition technologies requires transcribed speech so as to learn the mapping from sound to text. It is traditionally assumed that transcribers need to be native speakers of the language being transcribed. Mismatched crowdsourcing is the transcription of speech by crowd workers who do not speak the language. Given there are phonological similarities among different human languages, mismatched crowdsourcing does provide noisy data that can be aggregated to yield reliable labels. Here we discuss phonological properties of different languages in a coding-theoretic framework, and how nonnative phoneme misperception can be modeled as a noisy communication channel. We show the results of experiments demonstrating the efficacy of this information theory inspired modeling approach, having native English speakers and native Mandarin speakers transcribe Cantonese speech. Finally we discuss how crowd workers whose native language background give them the highest probability of faithful transcription can be found by solving a weighted set cover problem.
@inproceedings{varshney2016language, title = {Language coverage for mismatched crowdsourcing}, author = {Varshney, Lav R and Jyothi, Preethi and Hasegawa-Johnson, Mark}, booktitle = {2016 Information Theory and Applications Workshop (ITA)}, pages = {1--9}, year = {2016}, }

2015

Interspeech
Transcribing continuous speech using mismatched crowdsourcing.

Preethi Jyothi and Mark Hasegawa-Johnson

In Proceedings of Interspeech, 2015

Abs Bib PDF

Mismatched crowdsourcing derives speech transcriptions using crowd workers unfamiliar with the language being spoken. This approach has been demonstrated for isolated word transcription tasks, but never yet for continuous speech. In this work, we demonstrate mismatched crowdsourcing of continuous speech with a word error rate of under 45% in a large-vocabulary transcription task of short speech segments. In order to scale mismatched crowdsourcing to continuous speech, we propose a number of new WFST pruning techniques based on explicitly low-entropy models of the acoustic similarities among orthographic symbols as understood within a transcriber community. We also provide an information-theoretic analysis and estimate the amount of information lost in transcription by the mismatched crowd workers to be under 5 bits.
@inproceedings{jyothi2015transcribing, title = {Transcribing continuous speech using mismatched crowdsourcing.}, author = {Jyothi, Preethi and Hasegawa-Johnson, Mark}, booktitle = {Proceedings of Interspeech}, pages = {2774--2778}, year = {2015}, }
Interspeech
Improved hindi broadcast ASR by adapting the language model and pronunciation model using a priori syntactic and morphophonemic knowledge.

Preethi Jyothi and Mark Hasegawa-Johnson

In Proceedings of Interspeech, 2015

Abs Bib PDF

In this work, we present a new large-vocabulary, broadcast news ASR system for Hindi. Since Hindi has a largely phonemic orthography, the pronunciation model was automatically generated from text. We experiment with several variants of this model and study the effect of incorporating word boundary information with these models. We also experiment with knowledge-based adaptations to the language model in Hindi, derived in an unsupervised manner, that lead to small improvements in word error rate (WER). Our experiments were conducted on a new corpus assembled from publicly-available Hindi news broadcasts. We evaluate our techniques on an openvocabulary task and obtain competitive WERs on an unseen test set.
@inproceedings{jyothi2015improved, title = {Improved hindi broadcast ASR by adapting the language model and pronunciation model using a priori syntactic and morphophonemic knowledge.}, author = {Jyothi, Preethi and Hasegawa-Johnson, Mark}, booktitle = {Proceedings of Interspeech}, pages = {3164--3168}, year = {2015}, }
AAAI
Acquiring speech transcriptions using mismatched crowdsourcing

Preethi Jyothi and Mark Hasegawa-Johnson

In Proceedings of AAAI, 2015

Abs Bib PDF

Transcribed speech is a critical resource for building statistical speech recognition systems. Recent work has looked towards soliciting transcriptions for large speech corpora from native speakers of the language using crowdsourcing techniques. However, native speakers of the target language may not be readily available for crowdsourcing. We examine the following question: can humans unfamiliar with the target language help transcribe? We follow an information-theoretic approach to this problem:(1) We learn the characteristics of a noisy channel that models the transcribers’ systematic perception biases.(2) We use an error-correcting code, specifically a repetition code, to encode the inputs to this channel, in conjunction with a maximum-likelihood decoding rule. To demonstrate the feasibility of this approach, we transcribe isolated Hindi words with the help of Mechanical Turk workers unfamiliar with Hindi. We successfully recover Hindi words with an accuracy of over 85% (and 94% in a 4-best list) using a 15-fold repetition code. We also estimate the conditional entropy of the input to this channel (Hindi words) given the channel output (transcripts from crowdsourced workers) to be less than 2 bits; this serves as a theoretical estimate of the average number of bits of auxiliary information required for errorless recovery.
@inproceedings{jyothi2015acquiring, title = {Acquiring speech transcriptions using mismatched crowdsourcing}, author = {Jyothi, Preethi and Hasegawa-Johnson, Mark}, booktitle = {Proceedings of AAAI}, volume = {29}, number = {1}, year = {2015}, }
LabPhon
Models of dataset size, question design, and cross-language speech perception for speech crowdsourcing applications

Mark Hasegawa-Johnson, Jennifer Cole, Preethi Jyothi, and 1 more author

Laboratory Phonology, 2015

Abs Bib PDF

Transcribers make mistakes. Workers recruited in a crowdsourcing marketplace, because of their varying levels of commitment and education, make more mistakes than workers in a controlled laboratory setting. Methods for compensating transcriber mistakes are desirable because, with such methods available, crowdsourcing has the potential to significantly increase the scale of experiments in laboratory phonology. This paper provides a brief tutorial on statistical learning theory, introducing the relationship between dataset size and estimation error, then presents a theoretical description and preliminary results for two new methods that control labeler error in laboratory phonology experiments. First, we discuss the method of crowdsourcing over error-correcting codes. In the error-correcting-code method, each difficult labeling task is first factored, by the experimenter, into the product of several easy labeling tasks (typically binary). Factoring increases the total number of tasks, nevertheless it results in faster completion and higher accuracy, because workers unable to perform the difficult task may be able to meaningfully contribute to the solution of each easy task. Second, we discuss the use of explicit mathematical models of the errors made by a worker in the crowd. In particular, we introduce the method of mismatched crowdsourcing, in which workers transcribe a language they do not understand, and an explicit mathematical model of second-language phoneme perception is used to learn and then compensate their transcription errors. Though introduced as technologies that increase the scale of phonology experiments, both methods have implications beyond increased scale. The method of easy questions permits us to probe the perception, by untrained listeners, of complicated phonological models; examples are provided from the prosody of English and Hindi. The method of mismatched crowdsourcing permits us to probe, in more detail than ever before, the perception of phonetic categories by listeners with a different phonological system.
@article{hasegawa2015models, title = {Models of dataset size, question design, and cross-language speech perception for speech crowdsourcing applications}, author = {Hasegawa-Johnson, Mark and Cole, Jennifer and Jyothi, Preethi and Varshney, Lav R}, journal = {Laboratory Phonology}, volume = {6}, number = {3-4}, pages = {381--431}, year = {2015}, }
ICPhS
Prosodic and structural correlates of perceived prominence in Russian and Hindi.

Tatiana Luchkina, Jennifer S Cole, Preethi Jyothi, and 1 more author

In Proceedings of ICPhS, 2015

Abs Bib PDF

Perceived prominence in Russian and Hindi, free word order languages, can be communicated prosodically and structurally, via word order. Paired production and perception experiments with native speakers show that discourse-prominent constituents are marked acoustically, via a perceptible increase in vowel intensity and f0, and structurally, via a change in word order and placing a word into a designated position in a sentence or clause.
@inproceedings{luchkina2015prosodic, title = {Prosodic and structural correlates of perceived prominence in Russian and Hindi.}, author = {Luchkina, Tatiana and Cole, Jennifer S and Jyothi, Preethi and Puri, Vandana}, booktitle = {Proceedings of ICPhS}, year = {2015} }

2014

SIGMORPHON
Revisiting word neighborhoods for speech recognition

Preethi Jyothi and Karen Livescu

In Proceedings of the 2014 Joint Meeting of SIGMORPHON and SIGFSM, 2014

Abs Bib PDF

Word neighborhoods have been suggested but not thoroughly explored as an explanatory variable for errors in automatic speech recognition (ASR). We revisit the definition of word neighborhoods, propose new measures using a fine-grained articulatory representation of word pronunciations, and consider new neighbor weighting functions. We analyze the significance of our measures as predictors of errors in an isolated-word ASR system and a continuous-word ASR system. We find that our measures are significantly better predictors of ASR errors than previously used neighborhood density measures.
@inproceedings{jyothi2014revisiting, title = {Revisiting word neighborhoods for speech recognition}, author = {Jyothi, Preethi and Livescu, Karen}, booktitle = {Proceedings of the 2014 Joint Meeting of SIGMORPHON and SIGFSM}, pages = {1--9}, year = {2014}, }
SpeechProsody
An investigation of prosody in Hindi narrative speech

Preethi Jyothi, Jennifer Cole, Mark Hasegawa-Johnson, and 1 more author

In Proceedings of Speech Prosody, 2014

Abs Bib PDF

This paper investigates how prosodic elements such as prominences and prosodic boundaries in Hindi are perceived. We approach this using data from three sources:(i) native speakers of Hindi without any linguistic expertise (ii) a linguistically trained expert in Hindi prosody and finally,(iii) classifiers trained on English for automatic prominence and boundary detection. We use speech from a corpus of Hindi narrative speech for our experiments. Our results indicate that non-expert transcribers do not have a consistent notion of prosodic prominences. However, they show considerable agreement regarding the placement of prosodic boundaries. Also, relative to the nonexpert transcribers, there is higher agreement between the expert transcriber and the automatically derived labels for prominence (and prosodic boundaries); this suggests the possibility of using classifiers for automatic prediction of these prosodic events in Hindi.
@inproceedings{jyothi2014investigation, title = {An investigation of prosody in Hindi narrative speech}, author = {Jyothi, Preethi and Cole, Jennifer and Hasegawa-Johnson, Mark and Puri, Vandana}, booktitle = {Proceedings of Speech Prosody}, volume = {7}, pages = {623--627}, year = {2014} }

2013

Interspeech
Discriminative training of WFST factors with application to pronunciation modeling.

Preethi Jyothi, Eric Fosler-Lussier, and Karen Livescu

In Proceedings of Interspeech, 2013

Abs Bib PDF

One of the most popular speech recognition architectures consists of multiple components (like the acoustic, pronunciation and language models) that are modeled as weighted finite state transducer (WFST) factors in a cascade. These factor WFSTs are typically trained in isolation and combined efficiently for decoding. Recent work has explored jointly estimating parameters for these models using considerable amounts of training data. We propose an alternative approach to selectively train factor WFSTs in such an architecture, while still leveraging information from the entire cascade. This technique allows us to effectively estimate parameters of a factor WFST using relatively small amounts of data, if the factor is small. Our approach involves an online training paradigm for linear models adapted for discriminatively training one or more WFSTs in a cascade. We apply this method to train a pronunciation model for recognition on conversational speech, resulting in significant improvements in recognition performance over the baseline model.
@inproceedings{jyothi2013discriminative, title = {Discriminative training of WFST factors with application to pronunciation modeling.}, author = {Jyothi, Preethi and Fosler-Lussier, Eric and Livescu, Karen}, booktitle = {Proceedings of Interspeech}, pages = {1961--1965}, year = {2013}, }
IEEE
Conditional random fields in speech, audio, and language processing

Eric Fosler-Lussier, Yanzhang He, Preethi Jyothi, and 1 more author

Proceedings of the IEEE, 2013

Abs Bib PDF

Conditional random fields (CRFs) are probabilistic sequence models that have been applied in the last decade to a number of applications in audio, speech, and language processing. In this paper, we provide a tutorial overview of CRF technologies, pointing to other resources for more in-depth discussion; in particular, we describe the common linear-chain model as well as a number of common extensions within the CRF family of models. An overview of the mathematical techniques used in training and evaluating these models is also provided, as well as a discussion of the relationships with other probabilistic models. Finally, we survey recent work in speech, audio, and language processing to show how the same CRF technology can be deployed in different scenarios.
@article{fosler2013conditional, title = {Conditional random fields in speech, audio, and language processing}, author = {Fosler-Lussier, Eric and He, Yanzhang and Jyothi, Preethi and Prabhavalkar, Rohit}, journal = {Proceedings of the IEEE}, volume = {101}, number = {5}, pages = {1054--1075}, year = {2013}, }

2012

Interspeech
Discriminatively learning factorized finite state pronunciation models from dynamic Bayesian networks.

Preethi Jyothi, Eric Fosler-Lussier, and Karen Livescu

In Proceedings of Interspeech
This work received a Best Student Paper Award , 2012

Abs Bib PDF

This paper describes an approach to efficiently construct, and discriminatively train, a weighted finite state transducer (WFST) representation for an articulatory feature-based model of pronunciation. This model is originally implemented as a dynamic Bayesian network (DBN). The work is motivated by a desire to (1) incorporate such a pronunciation model in WFST-based recognizers, and to (2) learn discriminative models that are more general than the DBNs. The approach is quite general, though here we show how it applies to a specific model. We use the conditional independence assumptions imposed by the DBN to efficiently convert it into a sequence of WFSTs (factor FSTs) which, when composed, yield the same model as the DBN. We then introduce a linear model of the arc weights of the factor FSTs and discriminatively learn its weights using the averaged perceptron algorithm. We demonstrate the approach using a lexical access task in which we recognize a word given its surface realization. Our experimental results using a phonetically transcribed subset of the Switchboard corpus show that the discriminatively learned model performs significantly better than the original DBN.
@inproceedings{jyothi2012discriminatively, title = {Discriminatively learning factorized finite state pronunciation models from dynamic Bayesian networks.}, author = {Jyothi, Preethi and Fosler-Lussier, Eric and Livescu, Karen}, booktitle = {Proceedings of Interspeech}, pages = {1063--1066}, year = {2012} }
NAACL Workshop
Large-scale discriminative language model reranking for voice-search

Preethi Jyothi, Leif Johnson, Ciprian Chelba, and 1 more author

In Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model?, 2012

Abs Bib PDF

We present a distributed framework for largescale discriminative language models that can be integrated within a large vocabulary continuous speech recognition (LVCSR) system using lattice rescoring. We intentionally use a weakened acoustic model in a baseline LVCSR system to generate candidate hypotheses for voice-search data; this allows us to utilize large amounts of unsupervised data to train our models. We propose an efficient and scalable MapReduce framework that uses a perceptron-style distributed training strategy to handle these large amounts of data. We report small but significant improvements in recognition accuracies on a standard voice-search data set using our discriminative reranking model. We also provide an analysis of the various parameters of our models including model size, types of features, size of partitions in the MapReduce framework with the help of supporting experiments.
@inproceedings{jyothi2012large, title = {Large-scale discriminative language model reranking for voice-search}, author = {Jyothi, Preethi and Johnson, Leif and Chelba, Ciprian and Strope, Brian}, booktitle = {Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model?}, pages = {41--49}, year = {2012}, }
ICASSP
Distributed discriminative language models for Google voice-search

Preethi Jyothi, Leif Johnson, Ciprian Chelba, and 1 more author

In Proceedings of ICASSP, 2012

Abs Bib PDF

This paper considers large-scale linear discriminative language models trained using a distributed perceptron algorithm. The algorithm is implemented efficiently using a MapReduce/SSTable framework. This work also introduces the use of large amounts of unsupervised data (confidence filtered Google voice-search logs) in conjunction with a novel training procedure that regenerates word lattices for the given data with a weaker acoustic model than the one used to generate the unsupervised transcriptions for the logged data. We observe small but statistically significant improvements in recognition performance after reranking N-best lists of a standard Google voice-search data set.
@inproceedings{jyothi2012distributed, title = {Distributed discriminative language models for Google voice-search}, author = {Jyothi, Preethi and Johnson, Leif and Chelba, Ciprian and Strope, Brian}, booktitle = {Proceedings of ICASSP}, pages = {5017--5020}, year = {2012} }

2011

ICASSP
Lexical access experiments with context-dependent articulatory feature-based models

Preethi Jyothi, Karen Livescu, and Eric Fosler-Lussier

In Proceedings of ICASSP, 2011

Abs Bib PDF

We address the problem of pronunciation variation in conversational speech with a context-dependent articulatory feature-based model. The model is an extension of previous work using dynamic Bayesian networks, which allow for easy factorization of a state into multiple variables representing the articulatory features. We build context-dependent decision trees for the articulatory feature distributions, which are incorporated into the dynamic Bayesian networks, and experiment with different sets of context variables. We evaluate our models on a lexical access task using a phonetically transcribed subset of the Switchboard corpus. We find that our models outperform a context-dependent phonetic baseline.
@inproceedings{jyothi2011lexical, title = {Lexical access experiments with context-dependent articulatory feature-based models}, author = {Jyothi, Preethi and Livescu, Karen and Fosler-Lussier, Eric}, booktitle = {Proceedings of ICASSP}, pages = {4900--4903}, year = {2011}, }

2010

Interspeech
Discriminative language modeling using simulated ASR errors.

Preethi Jyothi and Eric Fosler-Lussier

In Proceedings of Interspeech, 2010

Abs Bib PDF

In this paper, we approach the problem of discriminatively training language models using a weighted finite state transducer (WFST) framework that does not require acoustic training data. The phonetic confusions prevalent in the recognizer are modeled using a confusion matrix that takes into account information from the pronunciation model (word-based phone confusion log likelihoods) and information from the acoustic model (distances between the phonetic acoustic models). This confusion matrix, within the WFST framework, is used to generate confusable word graphs that serve as inputs to the averaged perceptron algorithm to train the parameters of the discriminative language model. Experiments on a large vocabulary speech recognition task show significant word error rate reductions when compared to a baseline using a trigram model trained with the maximum likelihood criterion.
@inproceedings{jyothi2010discriminative, title = {Discriminative language modeling using simulated ASR errors.}, author = {Jyothi, Preethi and Fosler-Lussier, Eric}, booktitle = {Proceedings of Interspeech}, pages = {1049--1052}, year = {2010}, }
NAACL
Investigations into the Crandem approach to word recognition

Rohit Prabhavalkar, Preethi Jyothi, William Hartmann, and 2 more authors

In Proceedings of NAACL, 2010

Abs Bib PDF

We suggest improvements to a previously proposed framework for integrating Conditional Random Fields and Hidden Markov Models, dubbed a Crandem system (2009). The previous authors’ work suggested that local label posteriors derived from the CRF were too low-entropy for use in word-level automatic speech recognition. As an alternative to the log posterior representation used in their system, we explore frame-level representations derived from the CRF feature functions. We also describe a weight normalization transformation that leads to increased entropy of the CRF posteriors. We report significant gains over the previous Crandem system on the Wall Street Journal word recognition task.
@inproceedings{prabhavalkar2010investigations, title = {Investigations into the Crandem approach to word recognition}, author = {Prabhavalkar, Rohit and Jyothi, Preethi and Hartmann, William and Morris, Jeremy and Fosler-Lussier, Eric}, booktitle = {Proceedings of NAACL}, pages = {725--728}, year = {2010}, }

2009

Interspeech
A comparison of audio-free speech recognition error prediction methods.

Preethi Jyothi and Eric Fosler-Lussier

In Proceedings of Interspeech, 2009

Abs Bib PDF

Predicting possible speech recognition errors can be invaluable for a number of Automatic Speech Recognition (ASR) applications. In this study, we extend a Weighted Finite State Transducer (WFST) framework for error prediction to facilitate a comparison between two approaches of predicting confusable words: examining recognition errors on the training set to learn phone confusions and utilizing distances between the phonetic acoustic models for the prediction task. We also expand the framework to deal with continuous word recognition and we can accurately predict 60% of the misrecognized sentences (with an average words-per-sentence count of 15) and a little over 70% of the total number of errors from the unseen test data where no acoustic information related to the test data is utilized.
@inproceedings{jyothi2009comparison, title = {A comparison of audio-free speech recognition error prediction methods.}, author = {Jyothi, Preethi and Fosler-Lussier, Eric}, booktitle = {Proceedings of Interspeech}, pages = {1211--1214}, year = {2009} }