CS 747: Foundations of Intelligent and Learning Agents
(Autumn 2020)

(Credits: Samiran Roy. Graphic source: https://github.com/samiranrl/Carrom_rl)

Instructor

  Shivaram Kalyanakrishnan
  Office: Room 220, New CSE Building
  Phone: 7704
  E-mail: shivaram@cse.iitb.ac.in

Teaching Assistants

  Sabyasachi Ghosh
  E-mail: sghosh@cse.iitb.ac.in

  Santhosh Kumar G.
  E-mail: santhoshkg@iitb.ac.in

  Vinod Kushwaha
  E-mail: 193050059@iitb.ac.in

  Saurabh Warade
  E-mail: srwarade@cse.iitb.ac.in

  Pankaj Kumar
  E-mail: pankajkumar@cse.iitb.ac.in

  Smit Gangurde
  E-mail: smitgangurde@cse.iitb.ac.in

  Varshith Polu
  E-mail: poluvarshith@cse.iitb.ac.in

Course Description

Today's computing systems are increasingly adaptive and autonomous: they are akin to intelligent, decision-making "agents". With its roots in artificial intelligence and machine learning, this course covers the foundational principles of designing such agents. Topics covered include: (1) agency, intelligence, and learning; (2) exploration and multi-armed bandits; (3) Markov Decision Problems and planning; (4) reinforcement learning; (5) multi-agent systems and multi-agent learning; and (6) case studies.

The course will adopt a "hands-on" approach, with programming assignments designed to highlight the relationship between theory and practice. Case studies will offer an end-to-end view of deployed agents. It is hoped that students can apply the learnings from this course to the benefit of their respective pursuits in various areas of computer science and related fields.

Prerequisites

The course is open to all Ph.D. students, all masters students, and undergraduate/dual-degree students in their third (or higher) year of study.

The course does not formally have other courses as prerequisites. However, lectures and assignments will assume that the student is comfortable with probability and algorithms. Introduction to Probability by Grinstead and Snell is an excellent resource on basic probability. Any student who is not comfortable with the contents of chapters 1 through 7 (and is unable to solve the exercises) is advised against taking CS 747.

The course has an intensive programming component: based on ideas discussed in class, the student must be able to independently design, implement, and evaluate programs in python. The student must be prepared to spend a significant amount of time on the programming component of the course.

On-line Mode

The course will be conducted entirely in on-line mode.

All lecture slides and instructional videos will be made available on this page.

There shall be no synchronous meetings that students are mandated to attend.

The instructor will hold office hours in the allotted meeting slot (Slot 13: 7.00 p.m. – 8.25 p.m. Mondays and Thursdays). No new material will be presented during these slots.

Any questions and discussions that arise based on the lectures will be addressed by the instructor in a separate video, which will also be posted on this page.

Students are strongly encouraged to keep up with the weekly plan posted below, and should they have any questions for the instructor, bring them up through one of the channels listed. Nonetheless, students who are unable to interact with the instructor on a regular basis will be at no particular disadvantage. Students who are unable to access course material may please promptly inform the instructor.

Weekly Plan

Wednesday 12.00 p.m.: Lectures and slides for the week are put up on this page.

Wednesday–Sunday: Students watch the videos and make a note of questions and comments.

Wednesday–Sunday: Students post their questions and comments on the week's discussion forum (on Moodle). It is okay to ask questions based on previous lectures, and bring up topics of general interest.

Office hours (7.00 p.m. – 8.25 p.m. Mondays and Thursdays):
- Students with questions call the instructor's office phone (+91 22 2576 7704) 7.00 p.m. – 8.00 p.m. on Thursdays.
- The instructor is available on a web-based interaction platform 7.00 p.m. – 8.00 p.m. Mondays.
- Students may also request for the instructor to call them; the instructor makes these calls 8.00 p.m. – 8.25 p.m. on both Mondays and Thursdays.
Friday 11.55 p.m.: A quiz is published based on the week's material.

Tuesday 12.00 p.m.: The instructor uploads slides and (optionally) a video to address the salient questions and comments that came up during the week's interaction.

Students submit a response to the week's quiz (handwritten, scanned into pdf) by 11.55 p.m. Tuesday.

Details of the web-based interaction, as well as a form for requesting the instructor to call, will be provided on Moodle. In addition, students will be given a feedback form through which they can communicate issues related to the course at any point of time.

Evaluation

Grades will be based on weekly quizzes (each worth 2 marks and the aggregate capped to 20 marks); four programming assignments, each worth 10 marks; and an end-semester examination worth 40 marks. All assessments will be based on individual work.

Answers to the quizzes and the programming assignments must be turned in through Moodle. Late submissions will not be evaluated; they will receive no marks.

Students auditing the course must score 50 or more marks in the course to be awarded an "AU" grade.

Moodle

Moodle will be the primary course management system. Marks for the assessments will be maintained on the class Moodle page; discussion fora will also be hosted on Moodle. Students who do not have an account on Moodle for the course must send TAs Saurabh Warade and Pankaj Kumar a request by e-mail, specifying the roll number/employee number for account creation.

Academic Honesty

Students are expected to adhere to the highest standards of integrity and academic honesty. Academic violations, as detailed below, will be dealt with strictly, in accordance with the institute's procedures and disciplinary actions for academic malpractice.

Students are expected to work alone on all the quizzes and the programming assignments. They may not share code or consult with classmates (or anybody other than the instructor and TAs) regarding their solutions. They also may not look at solutions to the given quiz/assignment or related ones on the Internet. Violations will be considered acts of dishonesty.

Students are allowed to use resources on the Internet for programming (say to understand a particular command or a data structure), and also to understand concepts (so a Wikipedia page or someone's lecture notes or a textbook can certainly be consulted). It is also okay to use libraries or code snippets for portions unrelated to the core logic of the assignment—typically for operations such as moving data, network communication, etc. However, students must cite every resource consulted or used, whatever be the reason, in a file named references.txt, which must be included in the submission. Failure to list any resource used will be considered an academic violation.

Copying or consulting any external sources during the examination will be treated as cheating.

Texts and References

Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto, 2^nd edition, MIT Press, 2018. On-line version.

Algorithms for Reinforcement Learning, Csaba Szepesvári, Morgan & Claypool, 2009. On-line version.

Selected research papers.

On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples
William R. Thompson, 1933
A Stochastic Approximation Method
Herbert Robbins and Sutton Monro, 1951
Probability inequalities for sums of bounded random variables
Wassily Hoeffding, 1963
Asymptotically Efficient Adaptive Allocation Rules
T. L. Lai and Herbert Robbins, 1985
Self-Improving Reactive Agents Based On Reinforcement Learning, Planning and Teaching
Long-ji Lin, 1992
Practical Issues in Temporal Difference Learning
Gerald Tesauro, 1992
Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning
Ronald J. Williams, 1992
Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results
Sridhar Mahadevan, 1996
On the Complexity of Solving Markov Decision Problems
Michael L. Littman, Thomas L. Dean, and Leslie Pack Kaelbling, 1995
Learning to Trade via Direct Reinforcment
John Moody and Matthew Saffell, 2001
Finite-time Analysis of the Multiarmed Bandit Problem
Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer, 2002
Autonomous helicopter flight via reinforcement learning
Andrew Y. Ng, H. Jin Kim, Michael I. Jordan, and Shankar Sastry, 2003
Tree-based Batch Mode Reinforcement Learning
Damien Ernst, Pierre Geurts, and Louis Wehenkel, 2005
Half Field Offense in RoboCup Soccer: A Multiagent Reinforcement Learning Case Study
Shivaram Kalyanakrishnan, Yaxin Liu, and Peter Stone, 2007
Batch Reinforcement Learning in a Complex Domain
Shivaram Kalyanakrishnan and Peter Stone, 2007
Self-Optimizing Memory Controllers: A Reinforcement Learning Approach
Engin İpek, Onur Mutlu, José F. Martínez, and Rich Caruana, 2008
Adaptive Treatment of Epilepsy via Batch-mode Reinforcement Learning
Arthur Guez, Robert D. Vincent, Massimo Avoli, and Joelle Pineau, 2008
An Empirical Evaluation of Thompson Sampling
Olivier Chapelle and Lihong Li, 2011
The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond
Aurélien Garivier and Olivier Cappé, 2011
On Optimizing Interdependent Skills: A Case Study in Simulated 3D Humanoid Robot Soccer
Daniel Urieli, Patrick MacAlpine, Shivaram Kalyanakrishnan, Yinon Bentor, and Peter Stone, 2011
Analysis of Thompson Sampling for the multi-armed bandit problem
Shipra Agrawal and Navin Goyal, 2012
Human-level control through deep reinforcement learning
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis, 2015
Spatial interactions and optimal forest management on a fire-threatened landscape
Christopher J. Lauer, Claire A. Montgomery, and Thomas G. Dietterich, 2017
A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play
David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis, 2018
Five Proofs of Chernoff's Bound with Applications
Wolfgang Mulzer, 2019
Optimising a Real-time Scheduler for Railway Lines using Policy Search
Rohit Prasad, Harshad Khadilkar, Shivaram Kalyanakrishnan, 2020

Communication

This page will serve as the primary source of information regarding the course, the schedule, and related announcements. The Moodle page for the course will be used for recording grades and for students to post questions/comments.

E-mail is the best means of communicating with the instructor outside of office hours; students must send e-mail with "[CS747]" in the header.

Schedule

Week 0 (August 10‐16) : Welcome; Introduction to the course.
- Administrative: Video.
- Lecture 1: Video, Slides.
Week 1 (August 17‐23) : Multi-armed Bandits.
- Administrative: Video.
- Lecture 1: Video part 1, Video part 2, Video part 3, Video part 4, Video full, Slides.
- Q&A: Video, Slides.
- Summary: Coin-tossing game; Definition of stochastic multi-armed bandit; Definition of algorithm, ε-first and ε-greedy algorithms; Graph of E[r^t] versus t; Definition of regret.
- Reading: Sections 2, 2.1, 2.2, 2.3, Sutton and Barto (2018).
- Resource: coins.py.
Week 2 (August 24‐30) : Multi-armed Bandits.
- Administrative: Video.
- Lecture 1: Video part 1, Video part 2, Video part 3, Video part 4, Video part 5, Video full, Slides.
- Q&A: Video, Slides.
- Summary: Achieving sublinear regret with GLIE sampling; Lai and Robbins's lower bound on regret; UCB, KL-UCB, Thompson Sampling algorithms.
- Reading: Section 1, Figure 1, Theorem 1, Auer et al. (2002); Sections 1–3, Garivier and Cappé (2011); Chapelle and Li (2011).
- References: Class Note 1; Theorem 1, Lai and Robbins (1985).
Week 3 (August 31‐September 8) : Multi-armed Bandits.
- Administrative: Video.
- Lecture 1: Video, Slides.
- Q&A: Slides.
- Summary: Hoeffding's Inequality, "KL" Inequality; Proof of upper bound on the regret of UCB; Interpretation of Thompson Sampling; Survey of bandit formulations.
- Reading: Wikipedia page on Hoeffding's Inequality; Proof of Theorem 1, Auer et al. (2002).
- References: Thompson (1933); Hoeffding (1963); Mulzer (2019).
Week 4 (September 9‐September 15) : Markov Decision Problems.
- Administrative: Video.
- Lecture 1: Video, Slides.
- Q&A: Video, Slides.
- Summary: Definition of Markov Decision Problem, policy, and value function; Existence of optimal policy; MDP planning problem; Continuing and episodic tasks; Infinite-discounted, total, finite-horizon, and average reward structures; Applications of MDPs; Bellman's equations; Action value function.
- Reading: Chapter 3, Sutton and Barto (2018).
- References: Section 2.2, Mahadevan (1996); Ng et al. (2003); Lauer et al. (2017).
Week 5 (September 16‐September 22) : Markov Decision Problems.
- Administrative: Video.
- Lecture 1: Video, Slides.
- Q&A: Video, Slides.
- Summary: Banach's Fixed-point Theorem; Bellman optimality operator; Proof of contraction under max norm; Value iteration; Linear programming.
- Reading: Appendix A, Szepesvári (2009).
- Reference: Littman et al. (1995).
Week 6 (September 23‐September 29) : Markov Decision Problems.
- Administrative: Video.
- Lecture 1: Video, Slides.
- Summary: Policy improvement; Bellman operator; Proof of Policy improvement theorem; Policy Iteration family of algorithms; Complexity bounds for MDP planning.
- Reading: Sections 1 and 2, Kalyanakrishnan et al. (2016).
- References: Mansour and Singh (1999); Class Note 2 (forthcoming). References from Section 3 of the slides are either listed in the reading (Kalyanakrishnan et al. (2016)) or linked from the instructor's home page.
Week 7 (September 30‐October 3 and October 12‐October 13) : Reinforcement Learning.
- Administrative: Video.
- Lecture 1: Video, Slides.
- Summary: The Reinforcement Learning problem; Upcoming topics; Applications.
- References: Tesauro et al. (1992); Silver et al. (2018); Ng et al. (2003); Mnih et al. (2015); İpek et al. (2008); Guez et al. (2008); Moody and Saffell (2001).
Week 8 (October 14‐October 20) : Reinforcement Learning.
- Administrative: Video.
- Lecture 1: Video, Slides.
- Summary: Prediction and control problems; Ergodic MDPs; Model-based algorithm for acting optimally in the limit; Monte Carlo methods for prediction.
- Reading: Class Note 3; Sections 5, 5.1, Sutton and Barto (2018).
- Reference: Wikipedia page on Ergodic Markov chains.
Week 9 (October 21‐October 27) : Reinforcement Learning.
- Administrative: Video.
- Lecture 1: Video, Slides.
- Summary: Maximum likelihood estimates and least squares estimates; On-line implementation of Monte Carlo policy evaluation; Bootstrapping; TD(0) algorithm; Convergence of Monte Carlo and batch TD(0) algorithms; Model-free control: Q-learning, Sarsa, Expected Sarsa.
- Reading: Sections 6, 6.1, 6.2, 6.3, 6.4, 6.5, 6.6, Sutton and Barto (2018).
- Reference: Robbins and Monro (1951).
Week 10 (October 28‐November 3) : Reinforcement Learning.
- Administrative: Video.
- Lecture 1: Video, Slides.
- Summary: n-step returns; TD(λ) algorithm; Need for generalisation in practice; Soccer as illustrative example; Linear function approximation and geometric view; Linear TD(lambda).
- Reading: Sections 7, 7.1, 9, 9.1, 9.2, 9.3, 9.4, 12, 12.1, 12.2, Sutton and Barto (2018).
- Reference: Kalyanakrishnan et al. (2007).
Week 11 (November 4‐November 10) : Reinforcement Learning.
- Administrative: Video.
- Lecture 1: Video, Slides.
- Summary: Tile coding; Control with function approximation; Tsitsiklis and Van Roy's counterexample; Policy search; Case studies: humanoid robot soccer, railway scheduling.
- Reading: 9.5, 9.6, 9.7, 11, 11.1, 11.2, 11.3, Sutton and Barto (2018).
- References: Urieli et al. (2011), Prasad et al. (2020).
Week 12 (November 11‐November 17) : Reinforcement Learning.
- Administrative: Video.
- Lecture 1: Video, Slides.
- Summary: Policy gradient methods; Policy gradient theorem; REINFORCE; REINFORCE with a baseline; Actor-critic methods; Batch RL; Experience replay; Fitted Q Iteration.
- Reading: Sections 13, 13.1, 13.2, 13.3, 13.4, 13.5, Sutton and Barto (2018), Kalyanakrishnan and Stone (2007).
- References: Williams (1992), Lin (1992), Ernst et al. (2005).

Assignments

Week 1 Quiz, due 11.55 p.m. Sunday, August 23.
Week 2 Quiz, due 11.55 p.m. Sunday, August 30.
Week 3 Quiz, due 11.55 p.m. Tuesday, September 8.
Programming Assignment 1, due 11.55 p.m. Friday, September 25.
Week 4 Quiz, due 11.55 p.m. Tuesday, September 15.
Week 5 Quiz, due 11.55 p.m. Tuesday, September 22.
Week 6 Quiz, due 11.55 p.m. Wednesday, September 30.
Programming Assignment 2, due 11.55 p.m. Friday, October 23.
Week 8 Quiz, due 11.55 p.m. Tuesday, October 20.
Week 9 Quiz, due 11.55 p.m. Tuesday, October 27.
Programming Assignment 3, due 11.55 p.m. Friday, November 13.
Week 10 Quiz, due 11.55 p.m. Tuesday, November 3.
Week 11 Quiz, due 11.55 p.m. Wednesday, November 11.
Week 12 Quiz, due 11.55 p.m. Wednesday, December 2.
End-semester Examination, due 11.55 p.m. Wednesday, December 2.

Copyright

Slides and videos on this page are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Permission for their use beyond the scope of the license may be sought by writing to shivaram@cse.iitb.ac.in.

CS 747: Foundations of Intelligent and Learning Agents(Autumn 2020)