Vāksañcayaḥ - Sanskrit speech corpus has more than 78 hours of data and contains recordings of 45,953 sentences with a sampling rate of 22 KHz. The content is mainly readings of various texts spanning many Śāstras of Sanskrit literature and also includes contemporary stories, radio program, extempore discourse, etc.
The summary datasheet associated with this corpus can be accessed here - Link.
Fill the form to download the corpus - Form Link

If you use this corpus or its derivate resources for your research, kindly cite it as follows:
Devaraj Adiga, Rishabh Kumar, Amrith Krishna, Preethi Jyothi, Ganesh Ramakrishnan, and Pawan Goyal. "Automatic Speech Recognition in Sanskrit: A New Speech Corpus and Modelling Insights". Proceedings of The 59th Annual Meeting of the Association for Computational Linguistics (ACL Findings), 2021. Audio files which are in the MP3 format can be converted to the WAV format using the script provided along with the corpus for training ASR systems. Source code of the trained ASR model can be downloaded here - Github.
Details about the corpus, models used in the ASR system can be viewed in the presentation : Vāksañcayaḥ - A New Speech Corpus And Automatic Speech Recognition in Sanskrit. Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. This corpus was developed under the support and guidance of Prof. K. Ramasubramanian belonging to the Cell for Indian Science and Technology in Sanskrit (CISTS), Department of HSS, IIT Bombay. This work was conceptualised and implemented by Devaraja Adiga at CISTS. The first ever large scale ASR study for Sanskrit using this corpus was done under the supervision of "Prof. Ganesh Ramakrishnan and Prof. Preethi Jyothi from the Department of CSE, IIT Bombay and with the insights from Prof. Pawan Goyal, IIT Kharagpur" by Devaraja Adiga, Rishabh Kumar and Dr. Amrith Krishna.
We would like to express our gratitude to
i) The project staff of CISTS : Ashwini P. N., Dr. Dinesh Mohan and Akash Mathpal for their help in annotating the speech corpus.
ii) Volunteers who contributed to the corpus by recording various texts.
iii) People who have recorded Sanskrit texts and made it available for the public through the web.
Those who can speak Saṃskṛtam fluently can read the given texts and contribute to enlarge this corpus. Those who are interested in annotating the recorded corpus can actively participate. Please write to Devaraja Adiga for details.
Very soon we will release human-in-the-loop post-editing tools for enlarging speech corpora along the lines of similar tools that we have built for Optical character recognition.

Devaraja Adiga


Rishabh Kumar


Prof. Ganesh Ramakrishnan