Course Description

CS753 is a graduate-level CSE elective that offers an in-depth introduction to automatic speech recognition (ASR), the problem of automatically converting speech into text. This class will cover many theoretical and practical aspects of machine learning (ML) techniques that are employed in large-scale ASR systems. Apart from teaching classical algorithms that form the basis of statistical speech recognition, this class will also cover the latest deep learning techniques that have made important advances in achieving state-of-the-art results for speech recognition and related problems in spoken language processing.

Who can take this course: This course is open to 3rd and 4th year B.Tech., M.Tech. and Ph.D. students, who have passed a formal course in ML (offered by either the CSE, EE or IEOR department). If you do not satisfy the above criteria and are keen on taking this course, please email me at pjyothi [at] cse [dot] iitb [dot] ac [dot] in.

Audit requirements: To audit this course, students will have to complete all the assignments/quizzes and score at least 40% on each of them.

Course Info

Time: Tuesdays, Fridays, 2 pm to 3.25 pm
Venue: LH 102
Instructor: Preethi Jyothi


TAs:
Vinit Unni (email: vinit [at] cse.iitb.ac.in)
Saiteja Nalla (email: saitejan [at] cse.iitb.ac.in)
Debayan BandyoPadhyay (email: debayan [at] cse.iitb.ac.in)
Naman Jain (email: namanjain [at] cse.iitb.ac.in)
Instructor office hours: Scheduled on demand

Course grading

All assignments should be completed individually. No form of collaboration is allowed unless explicitly permitted by the instructor. Anyone deviating from these standards of academic integrity will be reported to the department's disciplinary committee.

    Subject to revision (depending on class strength)

  1. Three assignments OR Two assignments + 1 Quiz (35%)
  2. Midsem exam (15%)
  3. Project (20%)
  4. Final exam (25%)
  5. Participation (5%)

Schedule

Date Title Summary slides Reading
July 30 Introduction to Statistical Speech Recognition Lecture 1 S. Young, Large vocabulary continuous speech recognition: A review, IEEE Signal Processing Magazine, 1996.
(For a refresher in ML basics, go through Part I here).
Aug 6 HMMs for Acoustic Modeling (Part 1) Lecture 2 [JM-2019], Hidden Markov models
L. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, 1989.
Aug 9 HMMs for Acoustic Modeling (Part 2) Lecture 3 [JM-2019], Hidden Markov models
J. Bilmes, A gentle tutorial of the EM algorithm and its application to Gaussian-mixture HMMs, 1998.
Aug 13 HMMs and WFSTs Lecture 4 (pdf) and (html) (Read Sections 2.1-2.3 and 3) M. Mohri, F. Pereira, M. Riley, Speech recognition with weighted finite-state transducers, Springer Handbook of Speech Processing, 559-584, 2008.
[Additional reading] M. Mohri, F. Pereira, M. Riley, The Design Principles of a Weighted Finite-State Transducer Library, Theoretical Computer Science, 231(1): 17-32, 2000.
Aug 16 WFSTs continued Lecture 5 (pdf) and (html) [Additional reading] M. Mohri, Semiring frameworks and algorithms for shortest-distance problems, Journal of Automata, Languages and Combinatorics, 7(3):321-350, 2002.
Aug 20 WFSTs for ASR + Basics of speech production Lecture 6 M. Mohri, F. Pereira, M. Riley, Weighted finite-state transducers in speech recognition, Computer Speech and Language, 2001.
Aug 23 Tied-state HMMs + Introduction to NN-based AMs Lecture 7 S. J. Young, J. J. Odell, P. C. Woodland, Tree-Based state tying for high accuracy acoustic modelling, Proc. of the workshop of HLT, ACL, 1994.
[Useful reading] J. Zhao, X. Zhang, A. Ganapathiraju, N. Deshmukh, and J. Picone, Tutorial for Decision Tree-Based State Tying For Acoustic Modeling, 1999.
Aug 27 Neural network-based acoustic modeling
(Hybrid/Tandem/TDNN models)
Lecture 8 N. Morgan and H. A. Bourlard An Introduction to Hybrid HMM/Connectionist Continuous Speech Recognition, 1995.
H. Hermansky, D. Ellis, and S. Sharma, Tandem Connectionist Feature Extraction for Conventional HMM Systems, ICASSP, 2000.
V. Peddinti, D. Povey, S. Khudanpur, A time delay neural network architecture for efficient modeling of longtemporal contexts, Interspeech, 2015.
Aug 30 Intro to RNN-based models + Language modeling (Part I) Lecture 9 [JM-2019], "N-gram Language Models"
A. Graves, A. Mohamed, G. Hinton, Speech recognition with deep recurrent neural networks, ICASSP 2013.
Sept 3 Language modeling (Part II) Lecture 10 (pdf) and (html) [JM-2019], "N-gram Language Models"
S. F. Chen, J. Goodman, An empirical study of smoothing techniques for language modeling, Computer Speech and Language, 13, pp. 359-394, 1999.
Sept 13 Pre-midsem Revision Lecture 11 -
Sept 24 (R)NN-based language models Lecture 12 Papers mentioned in the slides
Sept 27 Acoustic feature analysis for ASR Lecture 13 Shared via Moodle
Oct 4 End-to-end neural architectures for ASR Lecture 14 A. Graves, N. Jaitly, Towards End-to-end Speech Recognition with Recurrent Neural Networks, ICML, 2014.
Awni Hannun, Sequence modeling with CTC, 2017.
Oct 9 End-to-end neural architectures for ASR Lecture 15 W. Chan, N. Jaitly, Q. V. Le, O. Vinyals, Listen, Attend and Spell, 2015.
Oct 11 Search and Decoding Lecture 16 D. Jurafsky, J. H. Martin, Speech and Language Processing, 1st edition, Chapter 10. (Shared via Moodle.)
Oct 15 Search and Decoding (Part II) Lecture 17 D. Jurafsky, J. H. Martin, Speech and Language Processing, 1st edition, Chapter 10. (Shared via Moodle.)
Oct 18 Multilingual and low-resource ASR Lecture 18 Papers mentioned in the slides
Oct 22 Speech Synthesis Lecture 19 Papers mentioned in the slides
Oct 25 Convolutional Neural Networks in Speech Lecture 20 Papers mentioned in the slides
Nov 1 Speaker Adaptation Lecture 21 Papers mentioned in the slides
Nov 5 Discriminative Training Lecture 22 Papers mentioned in the slides
Nov 8 GANs + Practice questions for the final Lecture 23 Papers mentioned in the slides

Course Project

For the final project, students are expected to work in groups of two or three. There will be a preliminary project evaluation phase, followed by the final evaluation; the latter will involve a project presentation and a detailed final report. Here are all the project abstracts from an old offering of the course.

Projects can be on any topic related to spoken language processing. (Projects on audio signal processing will also be permitted.) Every project should have a significant machine learning component. Students can also choose to reimplement techniques from prior work, after consulting with the instructor.

Click here for a list of freely available small sound examples. Here is another list of open speech and language resources.

Resources

All the suggested readings will be freely available online. No single textbook will serve as a reference for this course. ( [JM-2019] is a good starting point for anyone interested in this material.)
  1. Daniel Jurafsky and James H. Martin, "Speech and Language Processing", 3rd edition draft, 2019 [JM-2019]
  2. Mark Gales and Steve Young, The application of hidden Markov models in speech recognition, Foundations and Trends in Signal Processing, 1(3):195-304, 2008.
  3. Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and Brian Kingsbury, Deep Neural Networks for Acoustic Modeling in Speech Recognition, IEEE Signal Processing Magazine, 29(6):82-97, 2012.

Website credit: This is based on a Jekyll template.

Title image: Google iPhone Voice Search: Comic/Animation (CC by 2.0) by Kevin Cheng.