CS 747: Foundations of Intelligent and Learning Agents
(Autumn 2015)
Instructor
Shivaram Kalyanakrishnan
Office: SIA-204
Phone: 7716
E-mail: shivaram@cse.iitb.ac.in
Teaching Assistants
Apoorv Aggarwal
Office: SIC-212 (Info Lab)
E-mail: apag@cse.iitb.ac.in
Anand Babu N. B.
Office: SIA-311
E-mail: anandb@cse.iitb.ac.in
Class
Lectures will be held in SIC-301 during Slot 6: 11.05 a.m. –
12.30 p.m. Wednesdays and Fridays.
The instructor will hold office hours 4.00 p.m. – 6.00 p.m. on
Thursdays in SIA-204; meetings can also be arranged by
appointment.
Course Description
Today's computing systems are becoming increasingly adaptive and
autonomous: they are akin to intelligent, decision-making
``agents''. With its roots in artificial intelligence and machine
learning, this course covers the foundational principles of designing
such agents. Topics covered include: (1) agency, intelligence, and
learning; (2) exploration and multi-armed bandits; (3) Markov Decision
Problems and planning; (4) reinforcement learning; (5) search; (6)
multi-agent systems and multi-agent learning; and (7) case
studies.
The course will adopt a ``hands-on'' approach, with programming
assignments designed to highlight the relationship between theory and
practice. Case studies, as well as invited talks from experts, will
offer an ``end-to-end'' view of deployed agents. It is hoped that
students can apply the learnings from this course to the benefit of
their respective pursuits in various areas of computer science and
related fields.
Prerequisites
The course does not formally have other courses as
prerequisites. However, class lectures and assignments will assume
that the student is comfortable with probability and algorithms. The
course has an intensive programming component: based on ideas
discussed in class, the student must be able to independently design,
implement, and evaluate programs in a language of his/her choice. The
student must be prepared to spend a significant amount of time on each
programming assignment.
Evaluation
Grades will be decided based on four programming assignments, each
worth 15 marks; a mid-semester examination worth 15 marks; and an
end-semester examination worth 25 marks.
Programming assignments must be turned in through Moodle.
Students auditing the course must score 50 or more marks in the
course to be awarded an ``AU'' grade.
Academic Honesty
Students are expected to adhere to the highest standards of
integrity and academic honesty. Acts such as copying in the
examinations and sharing code for the programming assignments will be
dealt with strictly, in accordance with the
institute's procedures
and disciplinary
actions for academic malpractice.
Texts and References
Reinforcement Learning: An Introduction, Richard S. Sutton and
Andrew G. Barto, MIT Press,
1998. On-line
version.
Artificial Intelligence: A Modern Approach, Stuart J. Russell and
Peter Norvig, 3rd edition, Prentice-Hall, 2009.
Algorithms for Reinforcement Learning, Csaba Szepesvári,
Morgan & Claypool,
2009. On-line
version.
Dynamic Programming and Optimal Control, Volume II, Dimitri
P. Bertsekas, 4th edition, Athena Scientific, 2012.
Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit
Problems, Sébastien Bubeck and Nicolò Cesa-Bianchi, Foundations and
Trends in Machine Learning, Volume 5, Number 1,
2012. On-line
version.
Selected research papers:
-
A Formal Basis for the Heuristic Determination of Minimum Cost Paths
Peter E. Hart, Nils J. Nilsson, and Bertram Raphael, 1968
-
Modified Policy Iteration Algorithms for Discounted Markov Decision Problems
Martin L. Puterman and Moon Chirl Shin, 1978
-
Asymptotically
Efficient Adaptive Allocation Rules
T. L. Lai and Herbert Robbins, 1985
-
Learning to Predict by the Methods of Temporal Differences
Richard S. Sutton, 1988
-
Programming Robots Using Reinforcement Learning and Teaching
Long-ji Lin, 1991
-
Q-learning
Christopher J. C. H. Watkins and Peter Dayan, 1992
-
Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions
Ronald J. Williams and Leemon C. Baird, III, 1993
-
Markov games as a framework for multi-agent reinforcment learning
Michael L. Littman, 1994
-
An Upper Bound on the Loss from Approximate Optimal-Value Functions
Satinder P. Singh and Richard C. Yee, 1994
-
On the Complexity of Solving Markov Decision Problems
Michael L. Littman, Thomas L. Dean, and Leslie Pack Kaelbling, 1995
-
Average
Reward Reinforcement Learning: Foundations, Algorithms, and Empirical
Results
Sridhar Mahadevan, 1996
-
On the Complexity of Policy Iteration
Yishay Mansour and Satinder Singh, 1999
-
Policy invariance under reward transformations: Theory and application to reward shaping
Andrew Y. Ng, Daishi Harada, and Stuart Russell, 1999
-
Convergence
Results for Single-Step On-Policy Reinforcement-Learning Algorithms
Satinder Singh, Tommi Jaakkola, Michael L. Littman, and Csaba Szepesvári, 2000
-
Policy Gradient Methods for Reinforcment Learning with Function Approximation
Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour, 2000.
-
Rational and Convergent Learning in Stochastic Games
Michael Bowling and Manuela Veloso, 2001
-
Finite-time
Analysis of the Multiarmed Bandit Problem
Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer, 2002
-
Nash Q-Learning For General-Sum Stochastic Games
Junling Hu and Michael P. Wellman, 2003.
-
The
Nonstochastic Multiarmed Bandit Problem
Peter Auer,
Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire,
2002a
-
Variable Resolution Discretization in Optimal Control
Rémi Munos and Andrew Moore, 2002
-
Function Approximation via Tile Coding: Automating Parameter Choice
Alexander A. Sherstov and Peter Stone, 2005
-
Action
Elimination and Stopping Conditions for the Multi-Armed Bandit and
Reinforcement Learning Problems
Eyal Even-Dar, Shie Mannor,
and Yishay Mansour, 2006
-
If multi-agent learning is the answer, what is the question?
Yoav Shoham, Rob Powers, and Trond Grenager, 2006
-
Half Field Offense in RoboCup Soccer: A Multiagent Reinforcement Learning Case Study
Shivaram Kalyanakrishnan, Yaxin Liu, and Peter Stone, 2007
-
Stochastic
Linear Optimization under Bandit Feedback
Varsha Dani, Thomas P. Hayes, and Sham M. Kakade, 2008
-
Policy Search for Motor Primitives in Robotics
Jens Kober and Jan Peters, 2008
-
Exponential Lower Bounds For Policy Iteration
John Fearnley, 2010
-
Near-optimal Regret Bounds for Reinforcment Learning
Thomas Jaksch, Ronald Ortner, and Peter Auer, 2010
-
Learning Complementary Multiagent Behaviors: A Case Study
Shivaram Kalyanakrishnan and Peter Stone, 2010
-
A
contextual-bandit approach to personalized news article
recommendation
Lihong Li, Wei Chu, John Langford, and Robert
E. Schapire, 2010
-
Toward Off-Policy Learning Control With Function Approximation
Hamid Reza Maei, Csaba Szepesvári, Shalabh Bhatnagar, and Richard S. Sutton, 2010.
-
Artificial Intelligence: Foundations of Computational Agents
David Poole and Alan Mackworth, 2010
-
An Empirical Evaluation of Thompson Sampling
Olivier Chapelle and Lihong Li, 2011
-
The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond
Aurélien Garivier and Olivier Cappé, 2011
-
Learning Methods for Sequential Decision Making with Imperfect Representations
Shivaram Kalyanakrishnan, 2011
-
Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis
Emilie Kaufmann, Nathaniel Korda, and Rémi Munos, 2012
Communication
This page will serve as the primary source of information regarding
the course, the schedule, and related announcements. The Moodle page
for the course will be used for sharing resources for the lectures and
assignments, and also for recording grades.
E-mail is the best means of communicating with the instructor;
students must send e-mail with ``[CS747]'' in the header, with a copy
marked to the TAs.
Schedule
-
July 22: Welcome, Introduction to the course.
Summary: Conception of course, syllabus, policies, administrative matters.
-
July 24: Multi-armed bandits.
Reading: Sections 2 – 2.4, Sutton and Barto (1998).
Summary: Definition of stochastic multi-armed bandit, notion of
algorithm, greedy and ε-greedy algorithms, graph of expected
reward against number of pulls.
-
July 29: Multi-armed bandits.
Reading: Auer et al. (2002).
Summary: GLIE variants of ε-greedy algorithms, Softmax
exploration, deterministic exploration algorithms, Lai and Robbins'
lower bound on regret.
References: Singh et al. (2000), Lai and Robbins (1985).
-
July 31: Multi-armed bandits.
Reading: Auer et al. (2002).
Summary: UCB algorithm, Hoeffding's inequality, union bound, steps towards analysing UCB.
-
August 5: Multi-armed bandits.
Reading: Chapelle and Li (2011).
Summary: Proof of regret bound for UCB, introduction to KL-UCB and Thompson Sampling.
References: Garivier and Cappé (2011), Kaufmann et al. (2012)
-
August 7: Multi-armed bandits.
Summary: Preview of Programming Assignment 1, presentation of KL-UCB and Thompson Sampling, discussion of other bandit settings (adversarial, pure exploration, contextual and linear bandits).
References: Auer et al. (2002a), Even-Dar et al. (2006), Li et al. (2010), Dani et al. (2008).
-
August 12: Multi Armed Bandit Sampling in Nested Simulation
for Financial Portfolio Risk Measurement. Invited talk
by Sandeep
Juneja.
-
August 14: Some Cool Problems in Some Hot Areas. Invited talk by Krithi Ramamritham.
-
August 19: MDP Planning.
Reading: Chapter 3, Sutton and Barto (1998).
Summary: Agent-environment interface, ``design'' of agent behaviour through specification of rewards, Markov Decision Problem.
-
August 21: MDP Planning.
Summary: Definitions (MDP, policy, value function), optimal policy, Bellman's equations.
Reference: Mahadevan (1996).
-
August 26: MDP Planning.
Reading: Chapter 4, Sutton and
Barto (1998).
Summary: Illustration of policy evaluation through
example, solution of Bellman's equations (both by solving a system of
linear equations, and through iteration), Bellman's Optimality
Equations, MDP planning through Linear Programming.
Reference: Szepesvári (2009, Chapter 2 and Appendix A).
-
August 28: MDP Planning.
Summary: Policy Improvement Theorem, Policy Iteration, Value Iteration.
References: Singh and Yee (1994), Littman et al. (1995), Mansour and Singh (1999), Fearnley (2010).
-
September 2: MDP Planning.
Summary: Proof of Policy Improvement Theorem, brief discussions on the
approximation achieved by Value Iteration, Asynchronous Policy Iteration, Generalised Policy Iteration.
References: Williams and Baird (1993), Puterman and Shin (1978).
-
September 4: RoboCup: A Grand Challenge for AI.
References: Kalyanakrishnan et al. (2007), Kalyanakrishnan and Stone (2009).
-
September 12: Mid-semester examination. 8.30 a.m. – 10.30 a.m., LA 301.
-
September 16: Reinforcement Learning.
-
September 18: Reinforcement Learning.
Reading: Chapter 5, Sutton and Barto (1998).
Summary: First-visit and Every-visit Monte Carlo methods for policy evaluation, estimating Q-values.
-
September 23: Reinforcement Learning.
Summary: PAC analysis of Policy Iteration with Monte Carlo estimation of Q-values.
-
September 30: Reinforcement Learning.
Reading: Chapter 6, Sutton and Barto (1998).
Summary: TD(0), contrast with Monte Carlo methods.
Reference: Sutton (1988).
-
October 7: Reinforcement Learning.
Reading: Chapter 6, Sutton and Barto (1998).
Summary: Off-policy and on-policy control.
Reference: Watkins and Dayan (1992), Singh et al. (2000), Jaksch et al. (2010).
-
October 9: Reinforcement Learning.
Reading: Chapter 9, Sutton and Barto (1998).
Summary: Model-based RL (example: Dyna) and batch RL (example: Experience Replay).
Reference: Lin (1991).
-
October 14: Reinforcement Learning.
Summary: Illustration of the need for generalisation using Half Field Offense as an example.
Reference: Kalyanakrishnan et al. (2007).
-
October 16: Reinforcement Learning.
Reading: Chapter 8, Sutton and Barto (1998).
Summary: Neural networks, linear methods, gradient descent.
-
October 21: Reinforcement Learning.
Summary: Tile coding, gradient descent with linear function
approximation, convergence results, off-policy bootstrapping.
References: Munos and Moore (2002), Sherstov and Stone (2005), Maei et al. (2010).
-
October 23: Reinforcement Learning.
Reading: Sutton et al. (2000).
Summary: Representations, approximate value functions, and policies, policy gradient theorem.
Reference: Kober and Peters (2008).
-
October 28: Learning Methods for Sequential Decision Making with Imperfect Representations.
Reference: Kalyanakrishnan (2011).
-
October 30: Multi-agent learning.
Reading: Littman (1994).
Summary: Brief Introduction to reward shaping, Matrix games, Minimax-optimality, Markov games, Minimax-Q algorithm.
References: Ng et al. (1999), Bowling and Veloso (2001), Hu and Wellman (2003), Shoham et al. (2006).
-
November 4: Overview of CS 748: Advances in Intelligent and Learning Agents.
-
November 6: Search.
Summary: Applications of search, relationship with MDPs, Dijkstra's algorithm for graphs, breadth-first search and depth-first search, A* search.
References: Chapter 3, Poole and Mackworth (2010), Hart, Nilsson, and Raphael (1968).
-
November 21: End-semester examination. 9.30 a.m. – 12.30 p.m., LA 301.
Assignments
-
August 19: Programming assignment 1 announced on
Moodle. Submission deadline: 11.59 p.m., September 3, 2015.
-
September 17: Programming assignment 2 announced on
Moodle. Submission deadline: 11.59 p.m., October 3, 2015.
-
October 7: Programming assignment 3 announced on
Moodle. Submission deadline: 11.59 p.m., October 22, 2015.
-
October 25: Programming assignment 4 announced on
Moodle. Submission deadline: 11.59 p.m., November 10, 2015.