CS 747: Foundations of Intelligent and Learning Agents
(Autumn 2015)

Instructor

  Shivaram Kalyanakrishnan
  Office: SIA-204
  Phone: 7716
  E-mail: shivaram@cse.iitb.ac.in

Teaching Assistants

  Apoorv Aggarwal
  Office: SIC-212 (Info Lab)
  E-mail: apag@cse.iitb.ac.in

  Anand Babu N. B.
  Office: SIA-311
  E-mail: anandb@cse.iitb.ac.in

Class

Lectures will be held in SIC-301 during Slot 6: 11.05 a.m. – 12.30 p.m. Wednesdays and Fridays.

The instructor will hold office hours 4.00 p.m. – 6.00 p.m. on Thursdays in SIA-204; meetings can also be arranged by appointment.

Course Description

Today's computing systems are becoming increasingly adaptive and autonomous: they are akin to intelligent, decision-making ``agents''. With its roots in artificial intelligence and machine learning, this course covers the foundational principles of designing such agents. Topics covered include: (1) agency, intelligence, and learning; (2) exploration and multi-armed bandits; (3) Markov Decision Problems and planning; (4) reinforcement learning; (5) search; (6) multi-agent systems and multi-agent learning; and (7) case studies.

The course will adopt a ``hands-on'' approach, with programming assignments designed to highlight the relationship between theory and practice. Case studies, as well as invited talks from experts, will offer an ``end-to-end'' view of deployed agents. It is hoped that students can apply the learnings from this course to the benefit of their respective pursuits in various areas of computer science and related fields.

Prerequisites

The course does not formally have other courses as prerequisites. However, class lectures and assignments will assume that the student is comfortable with probability and algorithms. The course has an intensive programming component: based on ideas discussed in class, the student must be able to independently design, implement, and evaluate programs in a language of his/her choice. The student must be prepared to spend a significant amount of time on each programming assignment.

Evaluation

Grades will be decided based on four programming assignments, each worth 15 marks; a mid-semester examination worth 15 marks; and an end-semester examination worth 25 marks.

Programming assignments must be turned in through Moodle.

Students auditing the course must score 50 or more marks in the course to be awarded an ``AU'' grade.

Academic Honesty

Students are expected to adhere to the highest standards of integrity and academic honesty. Acts such as copying in the examinations and sharing code for the programming assignments will be dealt with strictly, in accordance with the institute's procedures and disciplinary actions for academic malpractice.

Texts and References

Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto, MIT Press, 1998. On-line version.

Artificial Intelligence: A Modern Approach, Stuart J. Russell and Peter Norvig, 3rd edition, Prentice-Hall, 2009.

Algorithms for Reinforcement Learning, Csaba Szepesvári, Morgan & Claypool, 2009. On-line version.

Dynamic Programming and Optimal Control, Volume II, Dimitri P. Bertsekas, 4th edition, Athena Scientific, 2012.

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Sébastien Bubeck and Nicolò Cesa-Bianchi, Foundations and Trends in Machine Learning, Volume 5, Number 1, 2012. On-line version.

Selected research papers:

A Formal Basis for the Heuristic Determination of Minimum Cost Paths
Peter E. Hart, Nils J. Nilsson, and Bertram Raphael, 1968
Modified Policy Iteration Algorithms for Discounted Markov Decision Problems
Martin L. Puterman and Moon Chirl Shin, 1978
Asymptotically Efficient Adaptive Allocation Rules
T. L. Lai and Herbert Robbins, 1985
Learning to Predict by the Methods of Temporal Differences
Richard S. Sutton, 1988
Programming Robots Using Reinforcement Learning and Teaching
Long-ji Lin, 1991
Q-learning
Christopher J. C. H. Watkins and Peter Dayan, 1992
Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions
Ronald J. Williams and Leemon C. Baird, III, 1993
Markov games as a framework for multi-agent reinforcment learning
Michael L. Littman, 1994
An Upper Bound on the Loss from Approximate Optimal-Value Functions
Satinder P. Singh and Richard C. Yee, 1994
On the Complexity of Solving Markov Decision Problems
Michael L. Littman, Thomas L. Dean, and Leslie Pack Kaelbling, 1995
Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results
Sridhar Mahadevan, 1996
On the Complexity of Policy Iteration
Yishay Mansour and Satinder Singh, 1999
Policy invariance under reward transformations: Theory and application to reward shaping
Andrew Y. Ng, Daishi Harada, and Stuart Russell, 1999
Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms
Satinder Singh, Tommi Jaakkola, Michael L. Littman, and Csaba Szepesvári, 2000
Policy Gradient Methods for Reinforcment Learning with Function Approximation
Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour, 2000.
Rational and Convergent Learning in Stochastic Games
Michael Bowling and Manuela Veloso, 2001
Finite-time Analysis of the Multiarmed Bandit Problem
Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer, 2002
Nash Q-Learning For General-Sum Stochastic Games
Junling Hu and Michael P. Wellman, 2003.
The Nonstochastic Multiarmed Bandit Problem
Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire, 2002a
Variable Resolution Discretization in Optimal Control
Rémi Munos and Andrew Moore, 2002
Function Approximation via Tile Coding: Automating Parameter Choice
Alexander A. Sherstov and Peter Stone, 2005
Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems
Eyal Even-Dar, Shie Mannor, and Yishay Mansour, 2006
If multi-agent learning is the answer, what is the question?
Yoav Shoham, Rob Powers, and Trond Grenager, 2006
Half Field Offense in RoboCup Soccer: A Multiagent Reinforcement Learning Case Study
Shivaram Kalyanakrishnan, Yaxin Liu, and Peter Stone, 2007
Stochastic Linear Optimization under Bandit Feedback
Varsha Dani, Thomas P. Hayes, and Sham M. Kakade, 2008
Policy Search for Motor Primitives in Robotics
Jens Kober and Jan Peters, 2008
Exponential Lower Bounds For Policy Iteration
John Fearnley, 2010
Near-optimal Regret Bounds for Reinforcment Learning
Thomas Jaksch, Ronald Ortner, and Peter Auer, 2010
Learning Complementary Multiagent Behaviors: A Case Study
Shivaram Kalyanakrishnan and Peter Stone, 2010
A contextual-bandit approach to personalized news article recommendation
Lihong Li, Wei Chu, John Langford, and Robert E. Schapire, 2010
Toward Off-Policy Learning Control With Function Approximation
Hamid Reza Maei, Csaba Szepesvári, Shalabh Bhatnagar, and Richard S. Sutton, 2010.
Artificial Intelligence: Foundations of Computational Agents
David Poole and Alan Mackworth, 2010
An Empirical Evaluation of Thompson Sampling
Olivier Chapelle and Lihong Li, 2011
The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond
Aurélien Garivier and Olivier Cappé, 2011
Learning Methods for Sequential Decision Making with Imperfect Representations
Shivaram Kalyanakrishnan, 2011
Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis
Emilie Kaufmann, Nathaniel Korda, and Rémi Munos, 2012

Communication

This page will serve as the primary source of information regarding the course, the schedule, and related announcements. The Moodle page for the course will be used for sharing resources for the lectures and assignments, and also for recording grades.

E-mail is the best means of communicating with the instructor; students must send e-mail with ``[CS747]'' in the header, with a copy marked to the TAs.

Schedule

July 22: Welcome, Introduction to the course.
Summary: Conception of course, syllabus, policies, administrative matters.
July 24: Multi-armed bandits.
Reading: Sections 2 – 2.4, Sutton and Barto (1998).
Summary: Definition of stochastic multi-armed bandit, notion of algorithm, greedy and ε-greedy algorithms, graph of expected reward against number of pulls.
July 29: Multi-armed bandits.
Reading: Auer et al. (2002).
Summary: GLIE variants of ε-greedy algorithms, Softmax exploration, deterministic exploration algorithms, Lai and Robbins' lower bound on regret.
References: Singh et al. (2000), Lai and Robbins (1985).
July 31: Multi-armed bandits.
Reading: Auer et al. (2002).
Summary: UCB algorithm, Hoeffding's inequality, union bound, steps towards analysing UCB.
August 5: Multi-armed bandits.
Reading: Chapelle and Li (2011).
Summary: Proof of regret bound for UCB, introduction to KL-UCB and Thompson Sampling.
References: Garivier and Cappé (2011), Kaufmann et al. (2012)
August 7: Multi-armed bandits.
Summary: Preview of Programming Assignment 1, presentation of KL-UCB and Thompson Sampling, discussion of other bandit settings (adversarial, pure exploration, contextual and linear bandits).
References: Auer et al. (2002a), Even-Dar et al. (2006), Li et al. (2010), Dani et al. (2008).
August 12: Multi Armed Bandit Sampling in Nested Simulation for Financial Portfolio Risk Measurement. Invited talk by Sandeep Juneja.
August 14: Some Cool Problems in Some Hot Areas. Invited talk by Krithi Ramamritham.
August 19: MDP Planning.
Reading: Chapter 3, Sutton and Barto (1998).
Summary: Agent-environment interface, ``design'' of agent behaviour through specification of rewards, Markov Decision Problem.
August 21: MDP Planning.
Summary: Definitions (MDP, policy, value function), optimal policy, Bellman's equations.
Reference: Mahadevan (1996).
August 26: MDP Planning.
Reading: Chapter 4, Sutton and Barto (1998).
Summary: Illustration of policy evaluation through example, solution of Bellman's equations (both by solving a system of linear equations, and through iteration), Bellman's Optimality Equations, MDP planning through Linear Programming.
Reference: Szepesvári (2009, Chapter 2 and Appendix A).
August 28: MDP Planning.
Summary: Policy Improvement Theorem, Policy Iteration, Value Iteration.
References: Singh and Yee (1994), Littman et al. (1995), Mansour and Singh (1999), Fearnley (2010).
September 2: MDP Planning.
Summary: Proof of Policy Improvement Theorem, brief discussions on the approximation achieved by Value Iteration, Asynchronous Policy Iteration, Generalised Policy Iteration.
References: Williams and Baird (1993), Puterman and Shin (1978).
September 4: RoboCup: A Grand Challenge for AI.
References: Kalyanakrishnan et al. (2007), Kalyanakrishnan and Stone (2009).
September 12: Mid-semester examination. 8.30 a.m. – 10.30 a.m., LA 301.
September 16: Reinforcement Learning.
September 18: Reinforcement Learning.
Reading: Chapter 5, Sutton and Barto (1998).
Summary: First-visit and Every-visit Monte Carlo methods for policy evaluation, estimating Q-values.
September 23: Reinforcement Learning.
Summary: PAC analysis of Policy Iteration with Monte Carlo estimation of Q-values.
September 30: Reinforcement Learning.
Reading: Chapter 6, Sutton and Barto (1998).
Summary: TD(0), contrast with Monte Carlo methods.
Reference: Sutton (1988).
October 7: Reinforcement Learning.
Reading: Chapter 6, Sutton and Barto (1998).
Summary: Off-policy and on-policy control.
Reference: Watkins and Dayan (1992), Singh et al. (2000), Jaksch et al. (2010).
October 9: Reinforcement Learning.
Reading: Chapter 9, Sutton and Barto (1998).
Summary: Model-based RL (example: Dyna) and batch RL (example: Experience Replay).
Reference: Lin (1991).
October 14: Reinforcement Learning.
Summary: Illustration of the need for generalisation using Half Field Offense as an example.
Reference: Kalyanakrishnan et al. (2007).
October 16: Reinforcement Learning.
Reading: Chapter 8, Sutton and Barto (1998).
Summary: Neural networks, linear methods, gradient descent.
October 21: Reinforcement Learning.
Summary: Tile coding, gradient descent with linear function approximation, convergence results, off-policy bootstrapping.
References: Munos and Moore (2002), Sherstov and Stone (2005), Maei et al. (2010).
October 23: Reinforcement Learning.
Reading: Sutton et al. (2000). Summary: Representations, approximate value functions, and policies, policy gradient theorem.
Reference: Kober and Peters (2008).
October 28: Learning Methods for Sequential Decision Making with Imperfect Representations.
Reference: Kalyanakrishnan (2011).
October 30: Multi-agent learning.
Reading: Littman (1994).
Summary: Brief Introduction to reward shaping, Matrix games, Minimax-optimality, Markov games, Minimax-Q algorithm.
References: Ng et al. (1999), Bowling and Veloso (2001), Hu and Wellman (2003), Shoham et al. (2006).
November 4: Overview of CS 748: Advances in Intelligent and Learning Agents.
November 6: Search.
Summary: Applications of search, relationship with MDPs, Dijkstra's algorithm for graphs, breadth-first search and depth-first search, A* search.
References: Chapter 3, Poole and Mackworth (2010), Hart, Nilsson, and Raphael (1968).
November 21: End-semester examination. 9.30 a.m. – 12.30 p.m., LA 301.

Assignments

August 19: Programming assignment 1 announced on Moodle. Submission deadline: 11.59 p.m., September 3, 2015.
September 17: Programming assignment 2 announced on Moodle. Submission deadline: 11.59 p.m., October 3, 2015.
October 7: Programming assignment 3 announced on Moodle. Submission deadline: 11.59 p.m., October 22, 2015.
October 25: Programming assignment 4 announced on Moodle. Submission deadline: 11.59 p.m., November 10, 2015.

CS 747: Foundations of Intelligent and Learning Agents(Autumn 2015)