Vāksañcayaḥ - Sanskrit speech corpus has more than 78 hours of data and contains recordings of 45,953 sentences with a sampling rate of 22 KHz. The content is mainly readings of various texts spanning many Śāstras of Sanskrit literature and also includes contemporary stories, radio program, extempore discourse, etc.
The summary datasheet associated with this corpus can be accessed here - Link.
Downloading the corpus
Citing the corpus
Devaraj Adiga, Rishabh Kumar, Amrith Krishna, Preethi Jyothi, Ganesh Ramakrishnan, and Pawan Goyal. "Automatic Speech Recognition in Sanskrit: A New Speech Corpus and Modelling Insights". Proceedings of The 59th Annual Meeting of the Association for Computational Linguistics (ACL Findings), 2021.
Pre-processing and source code
Details about the corpus, models used in the ASR system can be viewed in the presentation : Vāksañcayaḥ - A New Speech Corpus And Automatic Speech Recognition in Sanskrit.
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
People & Organizations involved
We would like to express our gratitude to
i) The project staff of CISTS : Ashwini P. N., Dr. Dinesh Mohan and Akash Mathpal for their help in annotating the speech corpus.
ii) Volunteers who contributed to the corpus by recording various texts.
iii) People who have recorded Sanskrit texts and made it available for the public through the web.
Very soon we will release human-in-the-loop post-editing tools for enlarging speech corpora along the lines of similar tools that we have built for Optical character recognition.