Automatic Speech Recognition (CS 753)

News

5/10/17:Pop quiz 3 was handed out in class today.
9/9/17: Midsem exam will be held in LH102 on Sept 14, 2017 between 3 pm to 5 pm.
14/8/17: All the project abstracts are posted here.
12/8/17: Assignment 1 has been posted. It is due by 11.59 pm on August 24, 2017 .
8/8/17: Here's the complete list of project abstracts from the last offering of this course.
31/7/17: Pop quiz 1 was handed out in class today.
12/7/17: Five copies of the book "Speech and Language Processing" are available with Harish at the CSE office. These are reserved books that you can borrow for a day against your ID card.
12/7/17: Welcome to CS 753. Important: There will be no class on July 17th and July 20th. Our first class will be on July 24th.

Course overview

This course offers an in-depth introduction to automatic speech recognition (ASR), the problem of automatically extracting text from human speech. This class will cover many theoretical and practical aspects of machine learning techniques that are employed in large-scale ASR systems. Apart from teaching classical algorithms that form the basis of statistical speech recognition, this class will also cover the latest deep learning techniques that have made important advances in achieving state-of-the-art results for speech recognition.

This course is offered as an elective and is open to B.Tech/M.Tech and Ph.D. students in the CSE and EE departments. It is recommended that students have passed "Foundations of machine learning (CS 725)". You can also sign up for the course if you have completed either "Foundations of Intelligent and Learning Agents (CS 747)" or "Advanced Machine Learning (CS 726)". If you haven't taken any of the above-mentioned courses but have prior experience with ML concepts via research projects, please check with your research guide for permission to enrol in this course and confirm your registration with me.

Here is the previous offering of this course.

Lectures

This section contains links to lecture slides and selected readings relevant to each topic. Slides might be posted here right before a given lecture in pdf format. The html version of the slides, along with any edits/corrections (if any) to the pdf version, will be uploaded after the lecture.

Date	Slides	Topic	Readings
July 24	pdf/html	Introduction to Statistical Speech Recognition	S. Young, Large vocabulary continuous speech recognition: A review, IEEE Signal Processing Magazine, 1996. If you want a refresher in machine learning basics, go through Part I in the following book: Deep Learning
July 27	pdf/html	Introduction to WFSTs and WFST algorithms	(Read Sections 2.1-2.3 and 3) M. Mohri, F. Pereira, M. Riley, Speech recognition with weighted finite-state transducers, Springer Handbook of Speech Processing, 559-584, 2008. [Additional reading] M. Mohri, F. Pereira, M. Riley, The Design Principles of a Weighted Finite-State Transducer Library, Theoretical Computer Science, 231(1): 17-32, 2000. [Additional reading] M. Mohri, Semiring frameworks and algorithms for shortest-distance problems, Journal of Automata, Languages and Combinatorics, 7(3):321-350, 2002.
July 31	pdf/html	WFST algorithms continued	(Read Sections 2.4 and 2.5) M. Mohri, F. Pereira, M. Riley, Speech recognition with weighted finite-state transducers, Springer Handbook of Speech Processing, 559-584, 2008. [Additional reading] For pseudocode and more details on determinization/minimization of WFSAs. M. Mohri, Weighted Automata Algorithms, Handbook of weighted automata. Springer Berlin Heidelberg, 2009. 213-254.
Aug 3	pdf/html	WFSTs continued + WFSTs in ASR	(Required reading) M. Mohri, F. Pereira, M. Riley, Weighted Finite-state Transducers in Speech Recognition , Computer Speech and Language, 16(1):69-88, 2002.
Aug 7	pdf/html	Hidden Markov Models (Part I)	(Read Sections I to V) Lawrence R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceedings of the IEEE, 77(2), 257-286, 1989. (Required reading) D. Jurafsky, J. H. Martin, "Chapter 9: Hidden Markov Models", Speech and Language Processing, Draft of November 7, 2016.
Aug 10	pdf/html	Hidden Markov Models (Part II)	(Required reading) Both articles listed against Aug 7. [Additional reading] A. P. Dempster, N. M. Laird, D. B. Rubin, Maximum Likelihood from Incomplete Data via the EM Algorithm, Journal of the Royal Statistical Society, Vol. 39, 1, 1977. [Additional reading] J. Bilmes, A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models., International Computer Science Institute 4.510, 1998.
Aug 14	pdf/html	Hidden Markov Models (Part III)	(Required reading) Both articles listed against Aug 7. (Required reading) S. J. Young, J. J. Odell, P. C. Woodland, Tree-Based state tying for high accuracy acoustic modelling, Proc. of the workshop of HLT, ACL, 1994. (Useful reading) J. Zhao, X. Zhang, A. Ganapathiraju, N. Deshmukh, and J. Picone, Tutorial for Decision Tree-Based State Tying For Acoustic Modeling, 1999.
Aug 17	pdf/html	DNNs in ASR	(Required reading) G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, Deep Neural Networks for Acoustic Modeling in Speech Recognition , IEEE Signal Processing Magazine, 29(6):82-97, 2012. (Useful reading, chapters 1 and 2) Michael Nielsen, Neural Networks and Deep Learning, Jan 2017. (Useful reading) N. Morgan and H. A. Bourlard An Introduction to Hybrid HMM/Connectionist Continuous Speech Recognition, 1995. (Useful reading) H. Hermansky, D. Ellis, and S. Sharma, Tandem Connectionist Feature Extraction for Conventional HMM Systems, Proceedings of ICASSP, 2000.
Aug 21	pdf/html	RNN-based models in ASR	(Required reading) A. Graves, N. Jaitly, Towards End-to-end Speech Recognition with Recurrent Neural Networks, Proceedings of ICML, 2014. (Useful reading) Z. Lipton, J. Berkowitz, C. Elkan, A critical review of recurrent neural networks for sequence learning, arXiv preprint arXiv:1506.00019, 2015.
Aug 24	pdf/html	Acoustic Feature Extraction for ASR + Basics of speech production	(Required reading) D. Jurafsky, J. H. Martin, Speech and Language Processing, 1st edition, Section 9.3 Feature extraction: MFCC vectors. (Shared via Moodle.) [Additional reading] D. Jurafsky, J. H. Martin, "Chapter 7: Phonetics", Speech and Language Processing (2nd edition), 2008.
Aug 28	pdf/html	Language Modeling (Part I)	(Required reading) D. Jurafsky, J. H. Martin, "Chapter 4: Language Modeling with N-grams", Speech and Language Processing, Draft of November 7, 2016. [Additional reading] C. Shannon, Prediction and entropy of printed English, 1950.
Aug 31	pdf/html	Language Modeling (Part II)	(Required reading) D. Jurafsky, J. H. Martin, "Chapter 4: Language Modeling with N-grams", Speech and Language Processing, Draft of November 7, 2016. (Required reading)S. F. Chen, J. Goodman, An empirical study of smoothing techniques for language modeling, Computer Speech and Language, 13, pp. 359-394, 1999.
Sep 7	pdf/html	Mid-semester revision lecture	-
Sep 11, 14	-	Midsem week	-
Sep 18	pdf/html	Language modeling (Part III)	(Required reading) H.Schwenk, Continuous space language models, Computer Speech and Language, 21(3), 492-518, 2007. (Required reading) T. Mikolov et al. Recurrent neural network language model, Proc. of Interspeech, 2010.
Sep 21	pdf/html	Search and decoding (Part I)	(Required reading) D. Jurafsky, J. H. Martin, Speech and Language Processing, 1st edition, Chapter 10. (Shared via Moodle.)
Sep 25	pdf/html	Search and decoding (Part II)	[Additional reading] L. Mangu, E. Brill and A. Stolcke Finding consensus in speech recognition: word error minimization and other applications of confusion networks, Computer Speech and Language, 14:4, 373-400, 2000.
Sep 28	pdf/html	Discriminative Training	(Required reading, Sections 1,2,3.2,5) K. Vertanen, An overview of Discriminative Training for Speech Recognition
Oct 5	quiz3	Quiz 3	Quiz3 was handed out in class and solutions discussed
Oct 9	pdf/html	Advanced Neural Models for ASR	(Useful reading) A. Graves and N. Jaitley, Towards End-to-end Speech Recognition with Recurrent Neural Networks, NIPS, 2014. (Useful reading)A. Maas, Z. Xie, D. Jurafksy, A. Ng Lexicon-Free Conversational Speech Recognition with Neural Networks, NAACL, 2015.
Oct 12	pdf/html	Advanced Neural Models for ASR	(Useful reading)Papers mentioned within the slides.
Oct 16	pdf/html	Pronunciation Modeling	(Useful reading)Papers mentioned within the slides.
Oct 23	pdf/html	Speaker Adaptation	(Useful reading)Papers mentioned within the slides.
Oct 26	pdf/html	Conversational Agents	(Required reading) D. Jurafsky, J. H. Martin, "Chapter 29: Dialog Systems and Chatbots", Speech and Language Processing, Draft of August 28, 2017. (Required reading) D. Jurafsky, J. H. Martin, "Chapter 30: Advanced Dialog Systems", Speech and Language Processing, Draft of August 28, 2017.
Oct 30	pdf/html	Quiz 4 + Intro to speech synthesis	Quiz 4 was handed out in class and solutions discussed. Speech synthesis was briefly introduced.
Nov 2	pdf/html	Statistical parametric speech synthesis (Part I)	(Required reading) H. Zen, K. Tokuda, A. W. Black, Statistical Parametric Speech Synthesis,Speech Communiation, 2009.
Nov 6	pdf/html	Statistical parametric speech synthesis (Part II)	(Useful reading)Z-H Ling, et al. Deep Learning for Acoustic Modeling in Parametric Speech Generation,IEEE Signal Processing Magazine, 2015.

Coursework

Three assignments: 10% + 10% + 10%
Mid-sem exam: 20%
Final project: 25%
Final exam: 25%

Audit requirements: In order to audit this course, students will have to complete all three assignments and score at least 40% on each of them.

Final Project: For the final project, students are expected to work in groups of (preferably) three. Students will be asked to write up a project report that summarizes the methodology adopted along with details of experiments. There will also be a project presentation at the end of the semester. Here are all the project abstracts from the previous offering of the course.

Every project should have a significant machine learning component. Projects can be on any topic related to spoken language processing. (Projects on audio signal processing will also be permitted. See below for a sample list of topics.) Students can also choose to reimplement techniques from prior work, after consulting with the instructor.

Here is a list of speech datasets that students can make use of for their final projects. Please contact the instructor by email to get a copy of any dataset listed here. Click here for a list of freely available small sound examples. Here is another list of open speech and language resources.

Here are some sample project topics to draw inspiration from:

Academic Integrity Policy

Students are expected to abide by the highest standards of academic integrity. All assignments should be completed individually, and no form of collaboration is allowed on the assignments unless explicitly permitted by the instructor. Absolutely no form of collaboration is allowed on the midsem/final exam. Receiving information from other students or external material during the midsem/final exam is completely unacceptable and will be seriously dealt with in line with the department's disciplinary policies.

Resources

No single textbook will serve as a reference for this course. Here are some recommended books and articles:

Daniel Jurafsky and James H. Martin, "Speech and Language Processing", 2nd edition, 2008.
Mark Gales and Steve Young, The application of hidden Markov models in speech recognition, Foundations and Trends in Signal Processing, 1(3):195-304, 2008.
Mehryar Mohri, Fernando Pereira and Michael Riley, Weighted Finite-state Transducers in Speech Recognition , Computer Speech and Language, 16(1):69-88, 2002.
Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and Brian Kingsbury, Deep Neural Networks for Acoustic Modeling in Speech Recognition , IEEE Signal Processing Magazine, 29(6):82-97, 2012.

Many more relevant references will be provided along with the lecture slides linked above.

We will make use of the open-source speech recognition toolkit, Kaldi, in one of the assignments. OpenFst is an open-source library that supports weighted finite transducers -- students could use this library to get more comfortable with WFSTs and WFST-based operations.