CS 747: Foundations of Intelligent and Learning Agents
(Autumn 2019)
(Picture source: https://elephantcountry.org/sites/default/files/styles/main_image_800width/public/images/201607/elephantfunny.jpg)
Instructor
Shivaram Kalyanakrishnan
Office: Room 220, New CSE Building
Phone: 7704
Email: shivaram@cse.iitb.ac.in
Teaching Assistants
Sabyasachi Ghosh
Email: 174050001@iitb.ac.in
Sounak Bhattacharya
Email: sounak@cse.iitb.ac.in
Aman Jangde
Email: amanjangde@cse.iitb.ac.in
Vinod Kushwaha
Email: vinod@cse.iitb.ac.in
Class
Lectures will be held during Slot 11: 3.30 p.m. – 4.55
p.m. Tuesdays and Fridays in LA 202. Office
hours will immediately follow class and be up to 5.30 p.m. on
Tuesdays and Fridays. Meetings can also be arranged by
appointment.
Course Description
Today's computing systems are becoming increasingly adaptive and
autonomous: they are akin to intelligent, decisionmaking
"agents". With its roots in artificial intelligence and machine
learning, this course covers the foundational principles of designing
such agents. Topics covered include: (1) agency, intelligence, and
learning; (2) exploration and multiarmed bandits; (3) Markov Decision
Problems and planning; (4) reinforcement learning; (5) multiagent
systems and multiagent learning; and (6) case studies.
The course will adopt a "handson" approach, with programming
assignments designed to highlight the relationship between theory and
practice. Case studies, as well as invited talks from experts, will
offer an "endtoend" view of deployed agents. It is hoped that
students can apply the learnings from this course to the benefit of
their respective pursuits in various areas of computer science and
related fields.
Prerequisites
The course is open to all Ph.D. students, all masters students, and
undergraduate/dualdegree students in their fourth (or higher) year of
study.
The course does not formally have other courses as
prerequisites. However, class lectures and assignments will assume
that the student is comfortable with probability and
algorithms. Introduction
to Probability by Grinstead and Snell is an excellent resource on
basic probability. Any student who is not comfortable with the
contents of chapters 1 through 7 (and is unable to solve the
exercises) is advised against taking CS 747.
The course has an intensive programming component: based on ideas
discussed in class, the student must be able to independently design,
implement, and evaluate programs in a language of his/her choice. The
student must be prepared to spend a significant amount of time on the
programming component of the course.
Evaluation
Grades will be based on four programming assignments, each worth 10
marks; a course project worth 20 marks; a midsemester examination
worth 15 marks; and an endsemester examination worth 25 marks. The
course project will be undertaken in groups of size at most 4; all
other assessments will be based on individual work.
The programming assignments and project must be turned in through
Moodle. Late submissions will not be evaluated; they will receive no
marks.
Students auditing the course must score 50 or more marks in the
course to be awarded an "AU" grade.
Getting accounts
 Marks for the assessments will be maintained on the class Moodle
page. Students who do not have an account on Moodle must send the
instructor a request by email with their roll number/employee
number.
 All programming assignments will be evaluated on the Software
Lab 2 machines. Students who do not have a CSE computing account
should request for access
on this
page.
Academic Honesty
Students are expected to adhere to the highest standards of
integrity and academic honesty. Academic violations, as detailed
below, will be dealt with strictly, in accordance with the
institute's procedures
and disciplinary
actions for academic malpractice.
Copying or consulting any external sources during the examinations will
be treated as cheating.
Students are expected to work alone on all the programming
assignments. They may not share code or consult with classmates (or
anybody other than the instructor and TAs) about their solutions. They
also may not look at solutions to the given assignment or related ones
on the Internet. Violations will be considered acts of dishonesty.
Students are allowed to use resources on the Internet for
programming (say to understand a particular command or a data
structure), and also to understand concepts (so a Wikipedia page or
someone's lecture notes or a textbook can certainly be consulted). It
is also okay to use libraries or code snippets for portions unrelated
to the core logic of the assignment—typically for operations
such as moving data, network communication, etc. However, students
must cite every resource consulted or used, whatever be the
reason, in a file named references.txt
, which must be
included in the submission. Failure to list any source will be
considered an academic violation.
Texts and References
Reinforcement Learning: An Introduction, Richard S. Sutton and
Andrew G. Barto, 2^{nd} edition
2018. Online
version.
Algorithms for Reinforcement Learning, Csaba Szepesvári,
Morgan & Claypool,
2009. Online
version.
Selected research papers.

A
Stochastic Approximation Method
Herbert Robbins and Sutton Monro, 1951

Asymptotically
Efficient Adaptive Allocation Rules
T. L. Lai and Herbert Robbins, 1985

Qlearning
Christopher J. C. H. Watkins and Peter Dayan, 1992

Reinforcement Learning with Replacing Eligibility Traces
Satinder P. Singh and Richard S. Sutton, 1996

Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results
Sridhar Mahadevan, 1996

Learning to Drive a Bicycle using Reinforcement Learning and Shaping
Jette Randløv and Preben Alstrøm, 1998

Policy invariance under reward transformations: Theory and application to reward shaping
Andrew Y. Ng, Daishi Harada, and Stuart Russell, 1999

Convergence
Results for SingleStep OnPolicy ReinforcementLearning Algorithms
Satinder Singh, Tommi Jaakkola, Michael L. Littman, and Csaba Szepesvári, 2000

Finitetime
Analysis of the Multiarmed Bandit Problem
Peter Auer, Nicolò CesaBianchi, and Paul Fischer, 2002

Bandit based MonteCarlo Planning
Levente Kocsis and Csaba Szepesvári, 2006

Learning Tetris Using the Noisy CrossEntropy Method
István Szita and András Lőrincz, 2006

Treebased Batch Mode Reinforcement Learning
Damien Ernst, Pierre Geurts, and Louis Wehenkel, 2005

Half
Field Offense in RoboCup Soccer: A Multiagent Reinforcement Learning
Case Study
Shivaram Kalyanakrishnan, Yaxin Liu, and Peter Stone, 2007

Batch Reinforcement Learning
in a Complex Domain
Shivaram Kalyanakrishnan and Peter Stone, 2007

An Empirical Evaluation of Thompson Sampling
Olivier Chapelle and Lihong Li, 2011

Learning Methods for Sequential Decision Making with Imperfect Representations
Shivaram Kalyanakrishnan, 2011

The KLUCB Algorithm for Bounded Stochastic Bandits and Beyond
Aurélien Garivier and Olivier Cappé, 2011

Humanlevel control through deep reinforcement learning
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel
Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas
K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir
Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan
Wierstra, Shane Legg, and Demis Hassabis, 2015

Mastering the game of Go with deep neural networks and tree search
David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis, 2016

Mastering the game of Go without human knowledge
David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis, 2017.


Efficiency Fairness Tradeoff in Battery Sharing
Karan N. Chadha, Ankur A. Kulkarni, and Jayakrishnan Nair, 2019.
Communication
This page will serve as the primary source of information regarding
the course, the schedule, and related announcements. The Moodle page
for the course will primarily be used for recording grades.
Email is the best means of communicating with the instructor;
students must send email with "[CS747]" in the header, with a copy
marked to the TAs.
Schedule

August 2: Welcome; Introduction to the course; Cointossing game.

August 6: Multiarmed Bandits.
Reading: Sections 2, 2.1, 2.2, 2.3, Sutton and Barto (2018).
Summary: Definition of stochastic multiarmed bandit; Definition of
algorithm, εfirst and εgreedy algorithms; Graph of
E[r^{t}] versus t.

August 9: Multiarmed Bandits.
Reading: Section 1, Figure 1, Theorem 1, Auer et al. (2002).
Summary: Definition of regret; Achieving sublinear regret with GLIE
(Greedy in the Limit with Infinite Exploration) sampling and in
particular, ε_{t}greedy sampling; Lai and Robbins's
lower bound on regret; UCB algorithm.
Reference: Class Note 1; Theorem 1, Lai and Robbins (1985).

August 13: Multiarmed Bandits.
Reading: Wikipedia page
on Hoeffding's
Inequality; Sections 1–3, Garivier and Cappé (2011);
Chapelle and Li (2011).
Summary: Hoeffding's Inequality; KLUCB; Thompson Sampling.

August 16: Multiarmed Bandits.
Reading: Slides.
Summary: Proof of upper bound on the regret of UCB.

August 20: Markov Decision Problems.
Reading: Chapter 3, Sutton and Barto (2018).
Summary: Definition of Markov Decision Problem, policy, and value
function; representing MDPs through state transition diagrams;
existence of optimal policy.
Reference: MDP example used in class.

August 23: Markov Decision Problems.
Summary: Reward models (infinite discounted, total, finite horizon,
average); Episodic and continuing tasks; Discounting; Existence of
optimal policy; Derivation of Bellman's Equations.
Reference: Section 2.2, Mahadevan (1996).

August 27: Markov Decision Problems.
Reading: Appendix A, Szepesvári (2009).
Summary: Bellman's Optimality Equations; Bellman Operator and
Bellman Optimality Operator; Banach's Fixed Point Theorem; Proof of
contraction under max norm.

August 30: Markov Decision Problems.
Reading: Slides.
Summary: Value Iteration; Linear Programming for MDP Planning; Policy Improvement Theorem and Policy Iteration.

September 3: Reinforcement Learning.
Reading: Class Note 2.
Summary: The Reinforcement Learning problem; Assumption on reachability of states; Modelbased learning algorithm for acting optimally in the limit.

September 13: Invited Talk, Ankur A. Kulkarni, Efficiency Fairness Tradeoff in Battery Sharing.
Reference: Chadha et al., 2019.

September 19: Midsemester examination.

September 24: Reinforcement Learning.
Reading: Sections 5, 5.1, 5.2, 5.3, Sutton and Barto (2018); Class Note 3.
Summary: Monte Carlo prediction; First Visit and Every Visit Monte Carlo methods; Monte Carlo ES algorithm for control.

September 27: Invited talk by Harshad Khadilkar: RL
applications in the real world: Some use cases in industrial
operations.
Reference: Slides.

October 1: Reinforcement learning.
Summary: Maximum likelihood estimates and least squares estimates;
online implementation of Monte Carlo policy evaluation;
bootstrapping; TD(0) algorithm; Convergence of batch TD(1) and batch TD(0) algorithms.
Reading: Sections 6, 6.1, 6.2, 6.3, Sutton and Barto (2018).
Reference: Robbins and Monro (1951).

October 4: Reinforcement learning.
Reading: Sections 7, 7.1, 7.2, 7.3, 12, 12.1, 12.2, 6.4, 6.5, 6.6,
Sutton and Barto (2018).
Summary: nstep returns; TD(λ) algorithm; Modelfree control:
Sarsa, Qlearning, Expected Sarsa.
References: Watkins and Dayan (1992), Singh et al. (2000).

October 11: Reinforcement learning.
Reading: Sections 9, 9.1, 9.2, 9.3, 9.4, 11.4, Sutton and Barto
(2018).
Summary: Need for generalisation in practice; soccer as illustrative
example; Linear function approximation and geometric view.
References: Section 2.1, Kalyanakrishnan et al. (2007).

October 15: Reinforcement learning.
Reading: Section 9.5, 12.2, 12.7, Sutton and Barto (2018).
Summary: Linear TD(λ); Tile coding.

October 18: Reinforcement learning.
Reading: Sections 10, 10.1, 10.2, 11, 11.1, 11.2, 11.3, Sutton and
Barto (2018).
Summary: Control with function approximation; Tsitsiklis and Van Roy's
counterexample: divergence of offpolicy bootstrapping with linear
function approximation; Policy Search; Comparison of value
functionbased methods and policy search methods in the game of
Tetris.
References: Chapter 4, Kalyanakrishnan (2011), Szita and Lőrincz (2006)

October 22: Reinforcement learning.
Reading: Sections 13, 13.1, 13.2, 13.3, Sutton and Barto (2017).
Summary: Brief review of tile coding; Policy gradient methods; Policy gradient theorem; REINFORCE.

October 25: Reinforcement learning.
Reading: Sections 13.4, 13.5, Sutton and Barto (2018), Kalyanakrishnan and Stone (2007).
Summary: REINFORCE with a baseline; Actorcritic methods; Batch RL; Experience replay; Fitted Q Iteration.
Reference: Ernst, Geurts, and Wehenkel (2005).

October 28: No class meeting.
Viewing: The Chip vs The Chess Master; AlphaGo.

November 1: Reinforcement learning.
Reading: Sections 8, 8.1, 8.2, 8.3, 8.4, Sutton and Barto (2018).
Summary: Modelbased Reinforcement Learning; DynaQ; Tree search.

November 5: Reinforcement Learning.
Reading: Sections 8.10, 8.11, Sutton and Barto (2018); Randløv and Alstrøm (1998); Ng et al. (1999).
Summary: Rollout Algorithms; Monte Carlo Tree Search (UCT); Reward shaping
Reference: Kocsis and Szepesvári, 2006

November 8: Reinforcement Learning.
Reading: Silver et al. (2016).
Summary: DQN algorithm on ATARI games; AlphaGo; AlphaGo Zero; AlphaZero.
Reference: Mnih et al. (2015); Silver et al. (2017); Wikipedia entry on AlphaZero.

November 13: Endsemester Examination.
Assignments

Programming Assignment 1 (10 marks).
Announced August 24, 2019; due 11.55 p.m., September 3, 2019.

Programming Assignment 2 (10 marks).
Announced September 5, 2019; due 11.55 p.m., September 15, 2019.

Project Proposal (5 marks).
Announced September 24, 2019; due 11.55 p.m., October 9, 2019.

Programming Assignment 3 (10 marks).
Announced October 18, 2019; due 11.55 p.m., October 28, 2019.

Project (15 marks).
Announced October 18, 2019; due 11.55 p.m., November 23, 2019.

Programming Assignment 4 (10 marks).
Announced October 29, 2019; due 11.55 p.m., November 8, 2019.