CS 747: Foundations of Intelligent and Learning Agents
(Autumn 2017)
(Picture source: Learning to Drive a Bicycle using Reinforcement Learning and Shaping, Randløv and Alstrøm, 1998.)
Instructor
Shivaram Kalyanakrishnan
Office: Room 220, New CSE Building
Phone: 7704
E-mail: shivaram@cse.iitb.ac.in
Teaching Assistants
Abhilash Panicker
E-mail: abhilashp@cse.iitb.ac.in
Pragy Agarwal
E-mail: agar.pragy@gmail.com
Prashanth M
E-mail: 163050043@iitb.ac.in
Class
Lectures will be held in Room 103, New CSE Building during Slot 6:
11.05 a.m. – 12.30 p.m. Wednesdays and Fridays.
Office hours will immediately follow class and be up to 1.15
p.m. on Wednesdays and Fridays. Meetings can also be arranged by
appointment.
Course Description
Today's computing systems are becoming increasingly adaptive and
autonomous: they are akin to intelligent, decision-making
``agents''. With its roots in artificial intelligence and machine
learning, this course covers the foundational principles of designing
such agents. Topics covered include: (1) agency, intelligence, and
learning; (2) exploration and multi-armed bandits; (3) Markov Decision
Problems and planning; (4) reinforcement learning; (5) multi-agent
systems and multi-agent learning; and (6) case studies.
The course will adopt a ``hands-on'' approach, with programming
assignments designed to highlight the relationship between theory and
practice. Case studies, as well as invited talks from experts, will
offer an ``end-to-end'' view of deployed agents. It is hoped that
students can apply the learnings from this course to the benefit of
their respective pursuits in various areas of computer science and
related fields.
Prerequisites
The course is open to all Ph.D. students, all masters students, and
undergraduate/dual-degree students in their fourth (or higher) year of
study.
The course does not formally have other courses as
prerequisites. However, class lectures and assignments will assume
that the student is comfortable with probability and
algorithms. Introduction
to Probability by Grinstead and Snell is an excellent resource on
basic probability. Any student who is not comfortable with the
contents of chapters 1 through 7 (and is unable to solve the
exercises) is advised against taking CS 747.
The course has an intensive programming component: based on ideas
discussed in class, the student must be able to independently design,
implement, and evaluate programs in a language of his/her choice. The
student must be prepared to spend a significant amount of time on the
programming component of the course.
Evaluation
Grades will be based on four programming assignments, each
worth 10 marks; a course project worth 20 marks; a mid-semester
examination worth 15 marks; and an end-semester examination worth 25
marks.
The programming assignments and project must be turned in through
Moodle. Late submissions will not be evaluated; they will receive no
marks.
Students auditing the course must score 50 or more marks in the
course to be awarded an ``AU'' grade.
Academic Honesty
Students are expected to adhere to the highest standards of
integrity and academic honesty. Acts such as copying in the
examinations and sharing code for the programming assignments will be
dealt with strictly, in accordance with the
institute's procedures
and disciplinary
actions for academic malpractice.
Texts and References
Reinforcement Learning: An Introduction, Richard S. Sutton and
Andrew G. Barto, 2nd edition (in progress)
2017. On-line
version.
Reinforcement Learning: An Introduction, Richard S. Sutton and
Andrew G. Barto, MIT Press,
1998. On-line
version.
Algorithms for Reinforcement Learning, Csaba Szepesvári,
Morgan & Claypool,
2009. On-line
version.
Selected research papers.
-
A
Stochastic Approximation Method
Herbert Robbins and Sutton Monro, 1951
-
Asymptotically
Efficient Adaptive Allocation Rules
T. L. Lai and Herbert Robbins, 1985
-
Q-learning
Christopher J. C. H. Watkins and Peter Dayan, 1992
-
Markov games as a framework for multi-agent reinforcment learning
Michael L. Littman, 1994
-
Foraging in an Uncertain Environment Using Predictive Hebbian Learning
P. Read Montague, Peter Dayan, and Terrence J. Sejnowski, 1994
-
On the Complexity of
Solving Markov Decision Problems
Michael L. Littman, Thomas
L. Dean, and Leslie Pack Kaelbling, 1995
-
Improving Elevator Performance Using Reinforcement Learning
Robert H. Crites and Andrew G. Barto, 1996
-
Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results
Sridhar Mahadevan, 1996
-
Dopamine neurons report and error in the temporal prediction of reward during learning
Jeffrey R. Hollerman and Wolfram Schultz, 1998
-
Learning to Drive a Bicycle using Reinforcement Learning and Shaping
Jette Randløv and Preben Alstrøm, 1998
-
On the Complexity of Policy Iteration
Yishay Mansour and Satinder Singh, 1999
-
Evolutionary Algorithms for Reinforcement Learning
David E. Moriarty, Alan C. Schultz, and John J. Greffenstette, 1999
-
Policy invariance under reward transformations: Theory and application to reward shaping
Andrew Y. Ng, Daishi Harada, and Stuart Russell, 1999
-
Algorithms for Inverse Reinforcement Learning
Andrew Y. Ng and Stuart Russell, 2000
-
Convergence
Results for Single-Step On-Policy Reinforcement-Learning Algorithms
Satinder Singh, Tommi Jaakkola, Michael L. Littman, and Csaba Szepesvári, 2000
-
Rational and Convergent Learning in Stochastic Games
Michael Bowling and Manuela Veloso, 2001
-
Finite-time
Analysis of the Multiarmed Bandit Problem
Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer, 2002
-
Nash Q-Learning For General-Sum Stochastic Games
Junling Hu and Michael P. Wellman, 2003.
-
Apprenticeship Learning via Inverse Reinforcement Learning
Pieter Abbeel and Andrew Y. Ng, 2004
-
Tree-based Batch Mode Reinforcement Learning
Damien Ernst, Pierre Geurts, and Louis Wehenkel, 2005
-
Bandit based Monte-Carlo Planning
Levente Kocsis and Csaba Szepesvári, 2006
-
If multi-agent learning is the answer, what is the question?
Yoav Shoham, Rob Powers, and Trond Grenager, 2006
-
Learning Tetris Using the Noisy Cross-Entropy Method
István Szita and András Lőrincz, 2006
-
Half
Field Offense in RoboCup Soccer: A Multiagent Reinforcement Learning
Case Study
Shivaram Kalyanakrishnan, Yaxin Liu, and Peter Stone, 2007
-
Batch Reinforcement Learning
in a Complex Domain
Shivaram Kalyanakrishnan and Peter Stone, 2007
-
Efficient Selection of Multiple Bandit Arms: Theory and Practice
Shivaram Kalyanakrishnan and Peter Stone, 2010
-
An Empirical Evaluation of Thompson Sampling
Olivier Chapelle and Lihong Li, 2011
-
The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond
Aurélien Garivier and Olivier Cappé, 2011
-
Learning Methods for Sequential Decision Making with Imperfect Representations
Shivaram Kalyanakrishnan, 2011
-
Characterizing Reinforcement Learning Methods through Parameterized Learning Problems
Shivaram Kalyanakrishnan and Peter Stone, 2011
-
PAC Subset Selection in Stochastic Multi-armed Bandits
Shivaram Kalyankrishnan, Ambuj Tewari, Peter Auer, and Peter Stone, 2012
-
Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis
Emilie Kaufmann, Nathaniel Korda, and Rémi Munos, 2012
-
Understanding Machine Learning: From Theory to Algorithms
Shai Shalev-Schwartz and Shai Ben-David, 2014
-
Human-level control through deep reinforcement learning
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel
Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas
K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir
Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan
Wierstra, Shane Legg, and Demis Hassabis, 2015
-
Batch-Switching Policy Iteration
Shivaram Kalyanakrishnan, Utkarsh Mall, and Ritish Goyal, 2016a
-
Randomised Procedures for Initialising and Switching Actions in Policy Iteration
Shivaram Kalyanakrishnan, Neeldhara Misra, and Aditya Gopalan, 2016b
-
Mastering the game of Go with deep neural networks and tree search
David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis, 2016
-
Mastering the game of Go without human knowledge
David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, Demis Hassabis, 2017
Communication
This page will serve as the primary source of information regarding
the course, the schedule, and related announcements. The Moodle page
for the course will be used for sharing resources for the lectures and
assignments, and also for recording grades.
E-mail is the best means of communicating with the instructor;
students must send e-mail with ``[CS747]'' in the header, with a copy
marked to the TAs.
Schedule
-
July 19: Welcome, Introduction to the course.
-
July 21: Multi-armed Bandits.
Reading: Sections 2, 2.1, 2.2, 2.3, Sutton and Barto (2017).
Summary: Coin-tossing game, definition of stochastic multi-armed
bandit, definition of algorithm, ε-greedy algorithm and
variants.
-
July 26: Multi-armed Bandits.
Reading: Sections 1, 2, Auer et al. (2002).
Summary: Graph of E[r^{t}] versus t, definition of regret,
achieving sublinear regret with epsilon_{t}-greedy sampling, Lai and
Robbins's lower bound on regret, UCB algorithm.
Reference: Theorem 1, Lai and Robbins (1985).
-
July 28: Multi-armed Bandits.
Reading: Wikipedia page
on Hoeffding's
Inequality, Proof of Theorem 1, Auer et al. (2002).
Summary: Hoeffding's Inequality, crux of proof of regret bound for UCB (bounding the
probability of pulling a sub-optimal arm beyond a threshold number of pulls).
-
August 2: Multi-armed Bandits.
Reading: Sections 1–3, Garivier and Cappé (2011),
Chapelle and Li (2011).
Summary: Second part of proof of regret bound for UCB (bounding the
expectation of the number of pulls of a sub-optimal arm), KL-UCB,
Thompson Sampling, discussion of bandit variations (adversarial setting, pure
exploration, modeling dependence between arms, nonstationarity).
-
August 4: MDP Planning.
Reading: Section 1, Kalyanakrishnan et al. (2016b).
Summary: Definition of Markov Decision Problem, policy, and value
function, representing MDPs through state transition diagrams, reward
models (infinite discounted, total, average, finite horizon),
existence of optimal policy.
Reference: Section 2.2, Mahadevan (1996).
-
August 9: MDP Planning.
Reading: Chapter 3, Sutton and Barto (2017).
Summary: Design of states, actions, and rewards; illustrative examples (soccer,
elevator control), derivation of Bellman's Equations, solution of
Bellman's Equations by variable elimination and by iteration,
Bellman's Optimality Equations.
Reference: Sections 1 and 2, Crites and Barto (1996).
-
August 11: MDP Planning.
Reading: Section 4.1, Littman, Dean, and Kaelbling (1995), Section
2.3, Szepesvári (2009).
Summary: Action value function, Value Iteration, Linear Programming
formulation for MDP planning, Bellman operator and its fixed point,
Bellman Optimality operator and its fixed point.
Reference: Wikipedia page
on Linear
Programming.
-
August 16: MDP Planning.
Reading: Section 2, Kalyanakrishnan et al. (2016b), Sections 1 and 2 of slides.
Summary: Strong bounds, Policy Iteration, proof of correctness.
-
August 18: MDP Planning.
Reading: Sections 3 and 4 of slides.
Summary: Complexity analysis of Howard's Policy Iteration, Mansour and
Singh's Randomised Policy Iteration, and Batch-switching Policy
Iteration on 2-action MDPs.
References: Mansour and Singh (1999), Kalyanakrishnan et al. (2016a).
-
August 23: Invited talk by Manjesh Kumar Hanawal on adversarial bandits.
Reading: Lecture notes.
Reference: Chapter 21, Shalev-Schwartz and Ben-David (2014).
-
August 30: Class canceled.
-
September 1: Reinforcement learning.
Reading: Class Note 1.
Summary: The Reinforcement Learning problem, Ergodic MDPs, model-based
learning algorithm for acting optimally in the limit.
Reference: Section 10.3, Sutton and Barto (2017); Section 2.2, Singh et al. (2000).
-
September 6: Reinforcement learning.
Reading: Sections 5, 5.1, 5.2, 5.3, 5.4, Sutton and Barto (2017), Class Note 2.
Summary: Monte Carlo policy evaluation, first-visit and every-visit Monte Carlo
methods, PAC policy evaluation.
-
September 8: Reinforcement learning.
Summary: Maximum likelihood estimates and least squares estimates,
on-line implementation of Monte Carlo policy evaluation,
bootstrapping, TD(0) algorithm.
Reading: Sections 6, 6.1, 6.2, 6.3, Sutton and Barto (2017).
Reference: Robbins and Monro (1951).
-
September 8: Invited talk by Vijay Nadkarni (Visteon
Corporation): The Artificial Intelligence Revolution in Autonomous
Driving and the Re-invention of the Automotive Industry. 5.30
p.m. – 6.30 p.m., F. C. Kohli Auditorium, KReSIT Building.
-
September 12: Mid-semester examination.
-
September 20: Reinforcement learning.
Reading: Sections 7, 7.1, 7.2, 7.3, 7.4, Sutton and Barto (1998).
Summary: TD learning in animals, TD(λ) algorithm.
References: Wikipedia
page on Classical Conditioning; Montague, Dayan, and Sejnowski
(1994); Hollerman and Schultz (1998).
-
September 22: Reinforcement learning.
Reading: Sections 6.4, 6.5, 6.6, Sutton and Barto (2017).
Summary: Model-free control: Sarsa, Q-learning, Expected Sarsa; Need
for generalisation in practice; soccer as illustrative example.
References: Watkins and Dayan (1992), Singh et al. (2000), Section 2.1, Kalyanakrishnan et al. (2007).
-
September 27: Reinforcement learning, Invited talk.
Reading: Sections 9, 9.1, 9.2, 9.3, 9.4, 11.4, Sutton and Barto
(2017).
Summary: Linear TD(λ), Geometric view of linear function
approximation, Invited talk by Harshad Khadilkar: Current
industrial applications of reinforcement learning.
Reference: Slides.
-
September 29: Reinforcement learning.
Reading: Sections 9.5, 10, 10.1, 10.2, 11, 11.1, 11.2, 11.3, Sutton and
Barto (2017).
Summary: Control with function approximation, Tsitsiklis and Van Roy's
counter-example, divergence of off-policy bootstrapping with linear
function approximation, Tile coding.
-
October 4: Reinforcement learning.
Reading: Sections 1, 2, 3, 4, 4.1, 4.2, 4.3, 4.4, 4.5, Kalyanakrishnan
and Stone (2011).
Summary: Tile coding (continued), Effect of representation on learning
methods, Comparison of Value Function-based methods and Policy Search
methods on Parameterised Learning Problems.
-
October 6: Reinforcement learning.
Reading: Chapter 4, Kalyanakrishnan (2011).
Summary: Illustration of the effect of representation on Value
Function-based methods and Policy Search methods in the game of
Tetris; Evolutionary algorithms.
References: Moriarty, Schultz, and Greffenstette (1999), Szita and Lőrincz (2006).
-
October 6: Reinforcement learning.
Reading: Sections 13, 13.1, 13.2, 13.3, Sutton and Barto (2017).
Summary: Brief review of tile coding, Policy gradient methods, Policy gradient theorem, REINFORCE.
-
October 13: Reinforcement learning.
Reading: Sections 13.4, 13.5, Sutton and Barto (2017), Kalyanakrishnan and Stone (2007).
Summary: REINFORCE with a baseline, Actor-critic methods, Spectrum of RL methods, Batch RL, Experience replay.
Reference: Appendices C, D, Kalyanakrishnan (2011).
-
October 18: Reinforcement learning.
Reading: Chapter 8, Sutton and Barto (2017).
Summary: Fitted Q Iteration, Model-based RL, Dyna-Q, Trajectory sampling, Prioritised sweeping.
Reference: Ernst, Geurts, and Wehenkel (2005).
-
October 20: Reinforcement learning.
Reading: Kocsis and Szepesvári (2006).
Summary: Full and sample backups, Real-time Dynamic Programming, Monte Carlo Tree Search, UCT.
-
October 25: Reinforcement learning.
Reading: Mnih et al. (2015), Silver et al. (2016).
Summary: DQN on ATARI video games, AlphaGo.
Reference: Silver et al. (2017).
-
October 27: Multi-agent learning.
Reading: Littman (1994).
Summary: Matrix games, Minimax-optimality, Markov games, Minimax-Q algorithm.
References: Bowling and Veloso (2001), Hu and Wellman (2003), Shoham et al. (2006).
-
November 1: Inverse reinforcement learning.
Reading: Ng and Russell (2000).
Summary: Inverse reinforcement learning problem, non-uniqueness of reward function, LP formulation.
Reference: Abbeel and Ng (2004).
-
November 3: Reward shaping.
Reading: Randløv and Alstrøm (1998), Ng et al. (1999).
Summary: Purpose of shaping rewards, bicycle control as example,
necessary and sufficient conditions for shaping rewards.
-
November 8: Pure exploration in multi-armed bandits.
Reading: Slides.
Summary: Pure exploration, PAC subset selection, LUCB algorithm.
References: Sections 1, 2, 3, Kalyanakrishnan and Stone (2010), Kalyanakrishnan et al. (2012).
-
November 22: End-semester examination.
Assignments
-
Programming Assignment 1 (10 marks).
Announced August 4, 2017; due 11.55 p.m., August 14, 2017.
-
Programming Assignment 2 (10 marks).
Announced August 23, 2017; due 11.55 p.m., September 5, 2017.
-
Programming Assignment 3 (10 marks).
Announced September 29, 2017; due 11.55 p.m., October 10, 2017.
-
Project Proposal (5 marks).
Announced October 5, 2017; due 11.55 p.m., October 17, 2017.
-
Project (15 marks).
Announced October 23, 2017; due 11.55 p.m., November 23, 2017.
-
Programming Assignment 4 (10 marks).
Announced November 1, 2017; due 11.55 p.m., November 10, 2017.
Examination Papers from Previous Offerings