Automatic Speech Recognition

Course Description

CS753 is a graduate-level CSE elective that offers an in-depth introduction to automatic speech recognition (ASR), the problem of automatically converting speech into text. This class will cover many theoretical and practical aspects of machine learning (ML) techniques that are employed in large-scale ASR systems. Apart from teaching classical algorithms that form the basis of statistical speech recognition, this class will also cover the latest deep learning techniques that have made important advances in achieving state-of-the-art results for speech recognition and related problems in spoken language processing.

Who can take this course: This course is open to 3rd and 4th year B.Tech., M.Tech. and Ph.D. students, who have passed a formal course in ML (offered by either the CSE, EE or IEOR department). If you do not satisfy the above criteria and are keen on taking this course, please email me at pjyothi [at] cse [dot] iitb [dot] ac [dot] in.

Audit requirements: To audit this course, students will have to complete all the assignments/quizzes and score at least 40% on each of them.

Course Info

Time: Tuesdays, Fridays, 2 pm to 3.25 pm
Venue: LH 102
Instructor: Preethi Jyothi

TAs:
Vinit Unni (email: vinit [at] cse.iitb.ac.in)
Saiteja Nalla (email: saitejan [at] cse.iitb.ac.in)
Debayan BandyoPadhyay (email: debayan [at] cse.iitb.ac.in)
Naman Jain (email: namanjain [at] cse.iitb.ac.in)
Instructor office hours: Scheduled on demand

Course grading

All assignments should be completed individually. No form of collaboration is allowed unless explicitly permitted by the instructor. Anyone deviating from these standards of academic integrity will be reported to the department's disciplinary committee.

Subject to revision (depending on class strength)

Three assignments OR Two assignments + 1 Quiz (35%)
Midsem exam (15%)
Project (20%)
Final exam (25%)
Participation (5%)

Schedule

Date	Title	Summary slides	Reading
July 30	Introduction to Statistical Speech Recognition	Lecture 1	S. Young, Large vocabulary continuous speech recognition: A review, IEEE Signal Processing Magazine, 1996. (For a refresher in ML basics, go through Part I here).
Aug 6	HMMs for Acoustic Modeling (Part 1)	Lecture 2	[JM-2019], Hidden Markov models L. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, 1989.
Aug 9	HMMs for Acoustic Modeling (Part 2)	Lecture 3	[JM-2019], Hidden Markov models J. Bilmes, A gentle tutorial of the EM algorithm and its application to Gaussian-mixture HMMs, 1998.
Aug 13	HMMs and WFSTs	Lecture 4 (pdf) and (html)	(Read Sections 2.1-2.3 and 3) M. Mohri, F. Pereira, M. Riley, Speech recognition with weighted finite-state transducers, Springer Handbook of Speech Processing, 559-584, 2008. [Additional reading] M. Mohri, F. Pereira, M. Riley, The Design Principles of a Weighted Finite-State Transducer Library, Theoretical Computer Science, 231(1): 17-32, 2000.
Aug 16	WFSTs continued	Lecture 5 (pdf) and (html)	[Additional reading] M. Mohri, Semiring frameworks and algorithms for shortest-distance problems, Journal of Automata, Languages and Combinatorics, 7(3):321-350, 2002.
Aug 20	WFSTs for ASR + Basics of speech production	Lecture 6	M. Mohri, F. Pereira, M. Riley, Weighted finite-state transducers in speech recognition, Computer Speech and Language, 2001.
Aug 23	Tied-state HMMs + Introduction to NN-based AMs	Lecture 7	S. J. Young, J. J. Odell, P. C. Woodland, Tree-Based state tying for high accuracy acoustic modelling, Proc. of the workshop of HLT, ACL, 1994. [Useful reading] J. Zhao, X. Zhang, A. Ganapathiraju, N. Deshmukh, and J. Picone, Tutorial for Decision Tree-Based State Tying For Acoustic Modeling, 1999.
Aug 27	Neural network-based acoustic modeling (Hybrid/Tandem/TDNN models)	Lecture 8	N. Morgan and H. A. Bourlard An Introduction to Hybrid HMM/Connectionist Continuous Speech Recognition, 1995. H. Hermansky, D. Ellis, and S. Sharma, Tandem Connectionist Feature Extraction for Conventional HMM Systems, ICASSP, 2000. V. Peddinti, D. Povey, S. Khudanpur, A time delay neural network architecture for efficient modeling of longtemporal contexts, Interspeech, 2015.
Aug 30	Intro to RNN-based models + Language modeling (Part I)	Lecture 9	[JM-2019], "N-gram Language Models" A. Graves, A. Mohamed, G. Hinton, Speech recognition with deep recurrent neural networks, ICASSP 2013.
Sept 3	Language modeling (Part II)	Lecture 10 (pdf) and (html)	[JM-2019], "N-gram Language Models" S. F. Chen, J. Goodman, An empirical study of smoothing techniques for language modeling, Computer Speech and Language, 13, pp. 359-394, 1999.
Sept 13	Pre-midsem Revision	Lecture 11	-
Sept 24	(R)NN-based language models	Lecture 12	Papers mentioned in the slides
Sept 27	Acoustic feature analysis for ASR	Lecture 13	Shared via Moodle
Oct 4	End-to-end neural architectures for ASR	Lecture 14	A. Graves, N. Jaitly, Towards End-to-end Speech Recognition with Recurrent Neural Networks, ICML, 2014. Awni Hannun, Sequence modeling with CTC, 2017.
Oct 9	End-to-end neural architectures for ASR	Lecture 15	W. Chan, N. Jaitly, Q. V. Le, O. Vinyals, Listen, Attend and Spell, 2015.
Oct 11	Search and Decoding	Lecture 16	D. Jurafsky, J. H. Martin, Speech and Language Processing, 1st edition, Chapter 10. (Shared via Moodle.)
Oct 15	Search and Decoding (Part II)	Lecture 17	D. Jurafsky, J. H. Martin, Speech and Language Processing, 1st edition, Chapter 10. (Shared via Moodle.)
Oct 18	Multilingual and low-resource ASR	Lecture 18	Papers mentioned in the slides
Oct 22	Speech Synthesis	Lecture 19	Papers mentioned in the slides
Oct 25	Convolutional Neural Networks in Speech	Lecture 20	Papers mentioned in the slides
Nov 1	Speaker Adaptation	Lecture 21	Papers mentioned in the slides
Nov 5	Discriminative Training	Lecture 22	Papers mentioned in the slides
Nov 8	GANs + Practice questions for the final	Lecture 23	Papers mentioned in the slides

Course Project

For the final project, students are expected to work in groups of two or three. There will be a preliminary project evaluation phase, followed by the final evaluation; the latter will involve a project presentation and a detailed final report. Here are all the project abstracts from an old offering of the course.

Projects can be on any topic related to spoken language processing. (Projects on audio signal processing will also be permitted.) Every project should have a significant machine learning component. Students can also choose to reimplement techniques from prior work, after consulting with the instructor.

Click here for a list of freely available small sound examples. Here is another list of open speech and language resources.

Resources

All the suggested readings will be freely available online. No single textbook will serve as a reference for this course. ( [JM-2019] is a good starting point for anyone interested in this material.)

Daniel Jurafsky and James H. Martin, "Speech and Language Processing", 3rd edition draft, 2019 [JM-2019]
Mark Gales and Steve Young, The application of hidden Markov models in speech recognition, Foundations and Trends in Signal Processing, 1(3):195-304, 2008.
Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and Brian Kingsbury, Deep Neural Networks for Acoustic Modeling in Speech Recognition, IEEE Signal Processing Magazine, 29(6):82-97, 2012.

Website credit: This is based on a Jekyll template.

Title image: Google iPhone Voice Search: Comic/Animation (CC by 2.0) by Kevin Cheng.