CS 747: Foundations of Intelligent and Learning Agents
(Autumn 2021)
(Picture source: https://upload.wikimedia.org/wikipedia/commons/7/7c/Emacs_Tetris_vector_based_detail.svg.)
(Page last edited .)
Instructor
Shivaram Kalyanakrishnan
Office: Room 220, New CSE Building
Phone: 7704
E-mail: shivaram@cse.iitb.ac.in
Teaching Assistants
Santhosh Kumar G.
E-mail: santhoshkg@iitb.ac.in
Dibyangshu Mukherjee
E-mail: dbnshu@cse.iitb.ac.in
Keshav Agarwal
E-mail: keshavagarwal@cse.iitb.ac.in
Ashish Aggarwal
E-mail: ashishaggarwal@cse.iitb.ac.in
Debabrata Biswal
E-mail: dbiswal@cse.iitb.ac.in
Mohith Jagalmohanan
E-mail: 203050073@iitb.ac.in
Shashank Shet
E-mail: shashankshet@cse.iitb.ac.in
Venkata Sai Baba Reddy Velugoti
E-mail: velugoti@cse.iitb.ac.in
Arance Kurmi
E-mail: arance@cse.iitb.ac.in
Yash Gadhia
E-mail: 180100130@iitb.ac.in
Rohan Gupta
E-mail: 180010048@iitb.ac.in
Gagan Jain
E-mail: 180100043@iitb.ac.in
Gosula Vinayaka
E-mail: 180050033@iitb.ac.in
Course Description
Today's computing systems are increasingly adaptive and autonomous:
they are akin to intelligent, decision-making "agents". With its roots
in artificial intelligence and machine learning, this course covers
the foundational principles of designing such agents. Topics covered
include: (1) agency, intelligence, and learning; (2) exploration and
multi-armed bandits; (3) Markov Decision Problems and planning; (4)
reinforcement learning; (5) multi-agent systems and multi-agent
learning; and (6) case studies.
The course will adopt a "hands-on" approach, with programming
assignments designed to highlight the relationship between theory and
practice. Case studies will offer an end-to-end view of deployed
agents. It is hoped that students can apply the learnings from this
course to the benefit of their respective pursuits in various areas of
computer science and related fields.
This offering of the course (in the on-line mode) will be very
similar to
the Autumn
2020 offering. The same video lectures (possibly with
minor changes) will be used. The evaluation pattern will also be
similar.
Prerequisites
The course is open to all Ph.D. students, all masters students, and
undergraduate/dual-degree students in their third (or higher) year of
study.
The course does not formally have other courses as
prerequisites. However, lectures and assignments will assume that the
student is comfortable with probability and
algorithms. Introduction
to Probability by Grinstead and Snell is an excellent resource on
basic probability. Any student who is not comfortable with the
contents of chapters 1 through 7 (and is unable to solve the
exercises) is advised against taking CS 747.
The course has an intensive programming component: based on ideas
discussed in class, the student must be able to independently design,
implement, and evaluate programs in python
. The student
must be prepared to spend a significant amount of time on the
programming component of the course.
Students who are unsure about their preparedness for taking the
course are strongly advised to watch the lectures from weeks 1, 2, and
3 from
the Autumn
2020 offering, to attempt the quizzes from those weeks, and also
to go through Programming Assignment 1. If they are unable to get a
reasonable grasp of the material or to negotiate the quizzes and
programming assignment, they are advised against taking CS 747.
On-line Mode
- The course will be conducted entirely in on-line mode.
- All lecture slides and instructional videos will be made available on
this page.
- There shall be no synchronous meetings that students are
mandated to attend.
- The instructor will hold office hours in the allotted meeting slot
(Slot 13: 7.00 p.m. – 8.25 p.m. Mondays and Thursdays). No new
material will be presented during these slots.
- Any significant questions and discussions that arise based on
interactions during the office hours will be addressed by the
instructor on Moodle.
- Students are strongly encouraged to keep up with the weekly plan
posted below, and should they have any questions for the instructor,
bring them up through one of the channels listed. Nonetheless, students
who are unable to interact with the instructor on a regular basis
will be at no particular disadvantage. Students who are unable to
access course material may please promptly inform the instructor.
Weekly Plan
Each "week" of the course will run Tuesday through the subsequent Monday.
- Tuesday 12.00 p.m.: Lectures and slides for the week are put up
on this page.
- Tuesday–Saturday: Students watch the videos and make a
note of questions and comments.
- Tuesday–Saturday: Students post their questions and
comments on the week's discussion forum (on Moodle). It is okay to
ask questions based on previous lectures, and to bring up topics of
general interest.
- Office hours (7.00 p.m. – 8.25 p.m. Mondays and
Thursdays):
- The instructor is available on a web-based interaction
platform 7.00 p.m. – 8.00 p.m. on both Mondays and
Thursdays. The web-based interaction will be recorded as a video
and shared with the class through Moodle.
- Students may also request
for the instructor to call them; the instructor makes these
calls 8.00 p.m. – 8.25 p.m. on both Mondays and Thursdays.
Thursday 11.55 p.m.: A quiz is published based on the
week's material.
Students submit a response to the week's quiz (handwritten,
scanned into pdf) by 11.55 p.m. Monday.
Details of the web-based interaction, as well as a form for
requesting the instructor to call back, will be provided on Moodle. In
addition, students will be given a feedback form through which they
can communicate issues related to the course at any point of time.
Evaluation
The evaluation will be based on the following three components.
- Quizzes each week for 3–4 marks, contributing a total of 40–45 marks.
- Three programming assignments, each for 11–14 marks, contributing a total of 35–40 marks.
- An end-semester examination for 35–40 marks.
The total marks from the three components will be at least
115. However, letter grades will be assigned by taking 100 as the
denominator. If a student scores X marks in total, the effective
marks for deciding the letter grade will be the minimum of X and
100. The purpose of this scheme is to provide students some amount
of slack, given that many students might be facing difficulties
related to the pandemic and the on-line mode. Students can still
aspire for the maximum grade even if they fail to turn in, or
perform poorly, in a couple of quizzes or a programming
assignment.
All assessments will be based on individual work.
Submissions to the quizzes and the programming assignments must be
turned in through Moodle. Late submissions will not be evaluated; they
will receive no marks.
Students auditing the course must have an effective score of 50 or
more marks in the course to be awarded an "AU" grade.
Moodle
Moodle will be the primary course management system. Marks for the
assessments will be maintained on the class Moodle page; discussion
fora will also be hosted on Moodle. Students who do not have an
account on Moodle for the course must send TA Santhosh Kumar
G. a request by e-mail, specifying the roll number/employee
number for account creation.
Academic Honesty
Students are expected to adhere to the highest standards of
integrity and academic honesty. Academic violations, as detailed
below, will be dealt with strictly, in accordance with the
institute's procedures
and disciplinary
actions for academic malpractice.
Students are expected to work alone on all the quizzes, the
programming assignments, and the end-semester examination. While
they are free to discuss the material presented in class with their
peers, they must not discuss the contents of the assessments
(neither the questions, nor the solutions) with classmates (or
anybody other than the instructor and TAs). They must not share
code, even if it only pertains to functionalities that are perceived
not to be relevant to the core logic of the assessment (for example,
file-handling and plotting). They also may not look at solutions to
the given quiz/assignment or related ones on the
Internet. Violations will be considered acts of dishonesty.
Students are allowed to use resources on the Internet for
programming (say to understand a particular command or a data
structure), and also to understand concepts (so a Wikipedia page or
someone's lecture notes or a textbook can certainly be
consulted). It is also okay to use libraries or code snippets for
portions unrelated to the core logic of the
assignment—typically for operations such as moving data,
network communication, etc. However, students must cite every
resource consulted or used, whatever be the reason, in a file
named references.txt
, which must be included in the
submission. Failure to list any resource used will be considered an
academic violation.
Copying or consulting any external sources during the examination
will be treated as cheating.
If in any doubt as to what is legitimate collaboration and what is
not, students must ask the instructor.
Texts and References
Reinforcement Learning: An Introduction, Richard S. Sutton and
Andrew G. Barto, 2nd edition, MIT Press,
2018. On-line
version.
Algorithms for Reinforcement Learning, Csaba Szepesvári,
Morgan & Claypool,
2009. On-line
version.
Selected research papers.
-
On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples
William R. Thompson, 1933
-
A
Stochastic Approximation Method
Herbert Robbins and Sutton Monro, 1951
-
Probability inequalities for sums of bounded random variables
Wassily Hoeffding, 1963
-
Asymptotically
Efficient Adaptive Allocation Rules
T. L. Lai and Herbert Robbins, 1985
-
Self-Improving Reactive Agents Based On Reinforcement Learning, Planning and Teaching
Long-ji Lin, 1992
-
Practical Issues in Temporal Difference Learning
Gerald Tesauro, 1992
-
Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning
Ronald J. Williams, 1992
-
Markov games as a framework for multi-agent reinforcment learning
Michael L. Littman, 1994
-
On the Complexity of Solving Markov Decision Problems
Michael L. Littman, Thomas L. Dean, and Leslie Pack Kaelbling, 1995
-
Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results
Sridhar Mahadevan, 1996
-
On the Complexity of Policy Iteration
Yishay Mansour and Satinder Singh, 1999
-
Rational and Convergent Learning in Stochastic Games
Michael Bowling and Manuela Veloso, 2001
-
Learning to Trade via Direct Reinforcment
John Moody and Matthew Saffell, 2001
-
Finite-time
Analysis of the Multiarmed Bandit Problem
Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer, 2002
-
Autonomous helicopter flight via reinforcement learning
Andrew Y. Ng, H. Jin Kim, Michael I. Jordan, and Shankar Sastry, 2003
-
Tree-based Batch Mode Reinforcement Learning
Damien Ernst, Pierre Geurts, and Louis Wehenkel, 2005
-
If multi-agent learning is the answer, what is the question?
Yoav Shoham, Rob Powers, and Trond Grenager, 2006
-
Half
Field Offense in RoboCup Soccer: A Multiagent Reinforcement Learning
Case Study
Shivaram Kalyanakrishnan, Yaxin Liu, and Peter Stone, 2007
-
Batch Reinforcement Learning
in a Complex Domain
Shivaram Kalyanakrishnan and Peter Stone, 2007
-
Self-Optimizing Memory Controllers: A Reinforcement Learning Approach
Engin İpek, Onur Mutlu, José F. Martínez, and Rich Caruana, 2008
-
Adaptive Treatment of Epilepsy via Batch-mode Reinforcement Learning
Arthur Guez, Robert D. Vincent, Massimo Avoli, and Joelle Pineau, 2008
-
An Empirical Evaluation of Thompson Sampling
Olivier Chapelle and Lihong Li, 2011
-
The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond
Aurélien Garivier and Olivier Cappé, 2011
-
On Optimizing Interdependent Skills: A Case Study in Simulated 3D Humanoid Robot Soccer
Daniel Urieli, Patrick MacAlpine, Shivaram Kalyanakrishnan, Yinon Bentor, and Peter Stone, 2011
-
Human-level control through deep reinforcement learning
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel
Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas
K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir
Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan
Wierstra, Shane Legg, and Demis Hassabis, 2015
-
Randomised Procedures for Initialising and Switching Actions in Policy Iteration
Shivaram Kalyanakrishnan, Neeldhara Misra, and Aditya Gopalan, 2016
-
Spatial interactions and optimal forest management on a fire-threatened landscape
Christopher J. Lauer, Claire A. Montgomery, and Thomas G. Dietterich, 2017
-
A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play
David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis, 2018
-
Five Proofs of Chernoff's Bound with Applications
Wolfgang Mulzer, 2019
-
Optimising a Real-time Scheduler for Railway Lines using Policy Search
Rohit Prasad, Harshad Khadilkar, Shivaram Kalyanakrishnan, 2020
Communication
This page will serve as the primary source of information regarding
the course, the schedule, and related announcements. The Moodle page
for the course will be used for recording grades and for
students to post questions/comments.
E-mail is the best means of communicating with the instructor
outside of office hours; students must send e-mail with "[CS747]" in
the header.
Schedule
-
Week 0 (July 27‐August 2) : Welcome; Introduction to the course.
-
Week 1 (August 3‐9) : Multi-armed Bandits.
- Administrative: Video.
- Lecture 1: Video, Slides.
- Summary: Coin-tossing game; Definition of stochastic multi-armed bandit; Definition of algorithm, ε-first and ε-greedy algorithms; Graph of E[rt] versus t; Definition of regret.
- Reading: Sections 2, 2.1, 2.2, 2.3, Sutton and Barto (2018).
- Resource: coins.py.
-
Week 2 (August 10‐16) : Multi-armed Bandits.
- Administrative: Video.
- Lecture 1: Video, Slides.
- Summary: Achieving sublinear regret with GLIE sampling; Lai and Robbins's lower bound on regret; UCB, KL-UCB, Thompson Sampling algorithms.
- Reading: Section 1, Figure 1, Theorem 1, Auer et al. (2002); Sections 1–3, Garivier and Cappé (2011); Chapelle and Li (2011).
- References: Class Note 1; Theorem 1, Lai and Robbins (1985).
-
Week 3 (August 17‐23) : Multi-armed Bandits.
- Administrative: Video.
- Lecture 1: Video, Slides.
- Summary: Hoeffding's Inequality, "KL" Inequality; Proof of upper bound on the regret of UCB; Interpretation of Thompson Sampling; Survey of bandit formulations.
- Reading: Wikipedia page on Hoeffding's Inequality; Proof of Theorem 1, Auer et al. (2002).
- References: Thompson (1933); Hoeffding (1963); Mulzer (2019).
-
Week 4 (August 24‐30) : Markov Decision Problems.
- Administrative: Video.
- Lecture 1: Video, Slides.
- Summary: Definition of Markov Decision Problem, policy, and value
function; Existence of optimal policy; MDP planning problem;
Continuing and episodic tasks; Infinite-discounted, total,
finite-horizon, and average reward structures; Applications of MDPs;
Bellman's equations; Action value function.
- Reading: Chapter 3, Sutton and Barto (2018).
- References: Section 2.2, Mahadevan (1996); Ng et al. (2003); Lauer et al. (2017).
-
Week 5 (August 31‐September 6) : Markov Decision Problems.
- Administrative: Video.
- Lecture 1: Video, Slides.
- Summary: Banach's Fixed-point Theorem; Bellman optimality operator; Proof of contraction under max norm; Value iteration; Linear programming.
- Reading: Appendix A, Szepesvári (2009).
- Reference: Littman et al. (1995).
-
Week 6 (September 7‐September 20) : Markov Decision Problems.
- Administrative: Video.
- Lecture 1: Video, Slides.
- Summary: Policy improvement; Bellman operator; Proof of Policy improvement theorem; Policy Iteration family of algorithms; Complexity bounds for MDP planning.
- Reading: Sections 1 and 2, Kalyanakrishnan et al. (2016).
- References: Mansour and Singh (1999); Class Note 2. References from Section 3 of the slides are either listed in the reading (Kalyanakrishnan et al. (2016)) or linked from the instructor's home page.
-
Week 7 (September 21‐September 27) : Reinforcement Learning.
- Administrative: Video.
- Lecture 1: Video, Slides.
- Summary: The Reinforcement Learning problem; Upcoming topics; Applications.
- References: Tesauro (1992); Silver et al. (2018); Ng et al. (2003); Mnih et al. (2015); İpek et al. (2008); Guez et al. (2008); Moody and Saffell (2001).
- Lecture 2: Video, Slides.
- Summary: Prediction and control problems; Ergodic MDPs; Model-based algorithm for acting optimally in the limit; Monte Carlo methods for prediction.
- Reading: Class Note 3; Sections 5, 5.1, Sutton and Barto (2018).
- Reference: Wikipedia page on Ergodic Markov chains.
-
Week 8 (September 28‐October 4) : Reinforcement Learning.
- Administrative: Video.
- Lecture 1: Video, Slides.
- Summary: Maximum likelihood estimates and least squares estimates;
On-line implementation of Monte Carlo policy evaluation;
Bootstrapping; TD(0) algorithm; Convergence of Monte Carlo and batch
TD(0) algorithms; Model-free control: Q-learning, Sarsa, Expected
Sarsa.
- Reading: Sections 6, 6.1, 6.2, 6.3, 6.4, 6.5, 6.6, Sutton and
Barto (2018).
- Reference: Robbins and Monro (1951).
-
Week 9 (October 5‐October 11) : Reinforcement Learning.
- Administrative: Video.
- Lecture 1: Video, Slides.
- Summary: n-step returns; TD(λ) algorithm; Need for
generalisation in practice; Soccer as illustrative example; Linear
function approximation and geometric view; Linear TD(λ).
- Reading: Sections 7, 7.1, 9, 9.1, 9.2, 9.3, 9.4, 12, 12.1, 12.2,
Sutton and Barto (2018).
- Reference: Kalyanakrishnan et al. (2007).
-
Week 10 (October 12‐October 18) : Reinforcement Learning.
- Administrative: Video.
- Lecture 1: Video, Slides.
- Summary: Tile coding; Control with function approximation;
Tsitsiklis and Van Roy's counterexample; Policy search; Case
studies: humanoid robot soccer, railway scheduling.
- Reading: 9.5, 9.6, 9.7, 11, 11.1, 11.2, 11.3, Sutton and
Barto (2018).
- References: Urieli et al. (2011), Prasad et al. (2020).
-
Week 11 (October 19‐October 25) : Reinforcement Learning.
- Administrative: Video.
- Lecture 1: Video, Slides.
- Summary: Policy gradient methods; Policy gradient theorem; REINFORCE; REINFORCE with a baseline; Actor-critic methods; Batch RL; Experience replay; Fitted Q Iteration.
- Reading: Sections 13, 13.1, 13.2, 13.3, 13.4, 13.5, Sutton and Barto (2018), Kalyanakrishnan and Stone (2007).
- References: Williams (1992), Lin (1992), Ernst et al. (2005).
-
Week 12 (October 26‐November 1) : Multiagent Reinforcement Learning.
- Administrative: Video.
- Lecture 1: Video, Slides.
- Summary: Matrix games, Minimax-optimality, Markov games, Minimax-Q algorithm.
- Reading: Littman, 1994.
- References: Bowling and Veloso (2001); Shoham, Powers, and Grenager (2006).
Assignments
- Week 1 Quiz, due 11.55 p.m. Monday, August 9.
- Week 2 Quiz, due 11.55 p.m. Monday, August 16.
- Week 3 Quiz, due 11.55 p.m. Monday, August 23.
- Week 4 Quiz, due 11.55 p.m. Monday, August 30.
- Programming Assignment 1, due 11.55 p.m. Thursday, September 9.
- Week 5 Quiz, due 11.55 p.m. Monday, September 6.
- Week 7 Quiz, due 11.55 p.m. Monday, September 27.
- Programming Assignment 2, due 11.55 p.m. Monday, October 11.
- Week 8 Quiz, due 11.55 p.m. Wednesday, October 6.
- Week 9 Quiz, due 11.55 p.m. Wednesday, October 13.
- Week 10 Quiz, due 11.55 p.m. Monday, October 18.
- Programming Assignment 3, due 11.55 p.m. Tuesday, November 2.
- Week 11 Quiz, due 11.55 p.m. Monday, October 25.
- End-semester Examination, due 11.55 p.m. Sunday, November 21.
Copyright
Slides and videos on this page are licensed under
a Creative Commons
Attribution-ShareAlike 4.0 International License. Permission for
their use beyond the scope of the license may be sought by writing to
shivaram@cse.iitb.ac.in.