CS753 is a graduate-level CSE elective that offers an in-depth introduction to automatic speech recognition (ASR), the problem of automatically converting speech into text. This class will cover many theoretical and practical aspects of machine learning (ML) techniques that are employed in large-scale ASR systems. Apart from teaching classical algorithms that form the basis of statistical speech recognition, this class will also cover the latest deep learning techniques that have made important advances in achieving state-of-the-art results for speech recognition and related problems in spoken language processing.
Who can take this course: This course is open to 3rd and 4th year B.Tech., M.Tech. and Ph.D. students, who have passed a formal course in ML (offered by either the CSE, EE or IEOR department). If you do not satisfy the above criteria and are keen on taking this course, please email me at pjyothi [at] cse [dot] iitb [dot] ac [dot] in.
Audit requirements: To audit this course, students will have to complete all the assignments/quizzes and score at least 40% on each of them.
Time: Tuesdays, Fridays, 2 pm to 3.25 pm
Venue: LH 102
Instructor: Preethi Jyothi
All assignments should be completed individually. No form of collaboration is allowed unless explicitly permitted by the instructor. Anyone deviating from these standards of academic integrity will be reported to the department's disciplinary committee.
|July 30||Introduction to Statistical Speech Recognition||Lecture 1||S. Young, Large vocabulary continuous speech recognition: A review, IEEE Signal Processing Magazine, 1996.
(For a refresher in ML basics, go through Part I here).
|Aug 6||HMMs for Acoustic Modeling (Part 1)||Lecture 2|| [JM-2019], Hidden Markov models
L. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, 1989.
|Aug 9||HMMs for Acoustic Modeling (Part 2)||Lecture 3|| [JM-2019], Hidden Markov models
J. Bilmes, A gentle tutorial of the EM algorithm and its application to Gaussian-mixture HMMs, 1998.
|Aug 13||HMMs and WFSTs||Lecture 4 (pdf) and (html)||(Read Sections 2.1-2.3 and 3) M. Mohri, F. Pereira, M. Riley, Speech recognition with weighted finite-state transducers, Springer Handbook of Speech Processing, 559-584, 2008. |
[Additional reading] M. Mohri, F. Pereira, M. Riley, The Design Principles of a Weighted Finite-State Transducer Library, Theoretical Computer Science, 231(1): 17-32, 2000.
|Aug 16||WFSTs continued||Lecture 5 (pdf) and (html)||[Additional reading] M. Mohri, Semiring frameworks and algorithms for shortest-distance problems, Journal of Automata, Languages and Combinatorics, 7(3):321-350, 2002.|
|Aug 20||WFSTs for ASR + Basics of speech production||Lecture 6||M. Mohri, F. Pereira, M. Riley, Weighted finite-state transducers in speech recognition, Computer Speech and Language, 2001.|
|Aug 23||Tied-state HMMs + Introduction to NN-based AMs||Lecture 7||S. J. Young, J. J. Odell, P. C. Woodland, Tree-Based state tying for high accuracy acoustic modelling, Proc. of the workshop of HLT, ACL, 1994.
[Useful reading] J. Zhao, X. Zhang, A. Ganapathiraju, N. Deshmukh, and J. Picone, Tutorial for Decision Tree-Based State Tying For Acoustic Modeling, 1999.
|Aug 27||Neural network-based acoustic modeling
|Lecture 8||N. Morgan and H. A. Bourlard An Introduction to Hybrid HMM/Connectionist Continuous Speech Recognition, 1995.
H. Hermansky, D. Ellis, and S. Sharma, Tandem Connectionist Feature Extraction for Conventional HMM Systems, ICASSP, 2000.
V. Peddinti, D. Povey, S. Khudanpur, A time delay neural network architecture for efficient modeling of longtemporal contexts, Interspeech, 2015.
|Aug 30||Intro to RNN-based models + Language modeling (Part I)||Lecture 9|| [JM-2019], "N-gram Language Models" |
A. Graves, A. Mohamed, G. Hinton, Speech recognition with deep recurrent neural networks, ICASSP 2013.
|Sept 3||Language modeling (Part II)||Lecture 10 (pdf) and (html)|| [JM-2019], "N-gram Language Models"
S. F. Chen, J. Goodman, An empirical study of smoothing techniques for language modeling, Computer Speech and Language, 13, pp. 359-394, 1999.
|Sept 13||Pre-midsem Revision||Lecture 11||-|
|Sept 24||(R)NN-based language models||Lecture 12||Papers mentioned in the slides|
|Sept 27||Acoustic feature analysis for ASR||Lecture 13||Shared via Moodle|
|Oct 4||End-to-end neural architectures for ASR||Lecture 14||A. Graves, N. Jaitly, Towards End-to-end Speech Recognition with Recurrent Neural Networks, ICML, 2014.
Awni Hannun, Sequence modeling with CTC, 2017.
|Oct 9||End-to-end neural architectures for ASR||Lecture 15||W. Chan, N. Jaitly, Q. V. Le, O. Vinyals, Listen, Attend and Spell, 2015.|
|Oct 11||Search and Decoding||Lecture 16||D. Jurafsky, J. H. Martin, Speech and Language Processing, 1st edition, Chapter 10. (Shared via Moodle.)|
|Oct 15||Search and Decoding (Part II)||Lecture 17||D. Jurafsky, J. H. Martin, Speech and Language Processing, 1st edition, Chapter 10. (Shared via Moodle.)|
|Oct 18||Multilingual and low-resource ASR||Lecture 18||Papers mentioned in the slides|
|Oct 22||Speech Synthesis||Lecture 19||Papers mentioned in the slides|
|Oct 25||Convolutional Neural Networks in Speech||Lecture 20||Papers mentioned in the slides|
|Nov 1||Speaker Adaptation||Lecture 21||Papers mentioned in the slides|
|Nov 5||Discriminative Training||Lecture 22||Papers mentioned in the slides|
|Nov 8||GANs + Practice questions for the final||Lecture 23||Papers mentioned in the slides|
Website credit: This is based on a Jekyll template.
Title image: Google iPhone Voice Search: Comic/Animation (CC by 2.0) by Kevin Cheng.