CS 747: Foundations of Intelligent and Learning Agents
(Spring 2025)

(Picture source: https://www.pexels.com/photo/monarch-butterflies-near-a-pinki-flower-13132850/)

(Page last edited .)

Instructor

  Shivaram Kalyanakrishnan
  Office: 220, CC Building
  Phone: 7704
  E-mail: shivaram@cse.iitb.ac.in

Teaching Assistants

  Sandarbh Yadav
  E-mail: 22d0374@iitb.ac.in

  Anvay Shah
  E-mail: anvay@cse.iitb.ac.in

  Vedang Gupta
  E-mail: 200100166@iitb.ac.in

  Sarvesh Gharat
  E-mail: sarvesh.gharat@iitb.ac.in

  Vedant Goswami
  E-mail: vedantg@cse.iitb.ac.in

  Vivek Kumar
  E-mail: 23m0816@iitb.ac.in

Meetings

Meetings will be held during Slot 12: 5.30 p.m. – 6.55 p.m. Mondays and Thursdays in LA 001. The instructor will be available for consultation immediately following class, up to 7.30 p.m., on both Mondays and Thursdays. He will also hold office hours (220, CC Building) 9.00 a.m. – 10.00 a.m. Tuesdays.

Course Description

Today's computing systems are increasingly adaptive and autonomous: they are akin to intelligent, decision-making "agents". With its roots in artificial intelligence and machine learning, this course covers the foundational principles of designing such agents. Topics covered include: (1) agency, intelligence, and learning; (2) exploration and multi-armed bandits; (3) Markov Decision Problems and planning; (4) reinforcement learning; (5) multi-agent systems and multi-agent learning; and (6) case studies.

The course will adopt a "hands-on" approach, with programming assignments designed to highlight the relationship between theory and practice. Case studies will offer an end-to-end view of deployed agents. It is hoped that students can apply the learnings from this course to the benefit of their respective pursuits in various areas of computer science and related fields.

The Spring 2025 offering of the course will run in the flipped classroom format. After initial meetings to introduce the course, each "week" will begin with video lectures (of length 2–3 hours) being made available for the students to watch. Students are expected to view the lectures and go through reading material that is provided alongside. The first meeting of the week will be a "review and tutorial" session, in which the contents of the lectures will be reviewed and questions from the students addressed. The instructor will also take up problem-solving and/or programming exercises in this session to demonstrate and reinforce the concepts covered in the lectures. The second meeting of the week will be for a test based on the week's lectures.

Eligibility

The course is open to all Ph.D. students, all masters students, and undergraduate/dual-degree students in their fourth or higher year of study. The course is also open to undergraduate/dual-degree students in their third year of study, provided their CPI is 8.50 or higher. The instructor regrets having to deny registration to many interested undergraduate/dual-degree students in their third year; the restriction is necessary to keep the class strength within the capacity of the largest available classroom on campus. Note that the CPI threshold of 8.50 may be increased during the first week of classes in case the registration count exceeds the cap imposed by the institute―if so third-year undergraduate/dual-degree students who do not meet the updated threshold must drop the course. Note that no exceptions will be made.

Prerequisites

The course does not formally have other courses as prerequisites. However, lectures and assignments will assume that the student is comfortable with probability and algorithms. Introduction to Probability by Grinstead and Snell is an excellent resource on basic probability. Any student who is not comfortable with the contents of chapters 1 through 7 (and is unable to solve the exercises) is advised against taking CS 747.

The course has an intensive programming component: based on ideas discussed in class, the student must be able to independently design, implement, and evaluate programs in python. The student must be prepared to spend a significant amount of time on the programming component of the course.

Students who are unsure about their preparedness for taking the course are strongly advised to watch the lectures from week 1, 2, and 3 from the Autumn 2020 offering, to attempt the quizzes from those weeks, and also to go through Programming Assignment 1. If they are unable to get a reasonable grasp of the material or to negotiate the quizzes and programming assignment, they are advised against taking CS 747.

Evaluation

Grades will be based on (1) weekly tests, together contributing 24 marks; (2) three programming assignments, each worth 12 marks; (3) a mid-semester examination worth 15 marks; and (4) an end-semester examination worth 25 marks. All assessments will be based on individual work.

There will be 10 or more weekly tests, each for 3 marks. The 8 best scores from these tests will count towards the aggregate of 24.

The programming assignments must be turned in through Moodle. Late submissions will not be evaluated; they will receive no marks.

Students auditing the course must score 50 or more marks in the course to be awarded an "AU" grade.

Evaluation will be contingent on the student agreeing to comply with the course policies on academic honesty and submissions.

Make-up Test

Students who encounter any medical issues during the course must write to the instructor as soon as possible with an official record of their sickness. If they are unable to appear in either of the exams, or to submit any of the three programming assignments, due to sickness, they may request to be re-evaluated.

Weekly tests that are missed due to medical issues will not be compensated, unless a student has fewer than 8 tests for which they were medically fit. Hence, for example, if 11 weekly tests were conducted, of which the student was medically fit for 9, then they will not have a make-up for the weekly tests. On the other hand, a student who was medically fit for only 5 tests can make up for 3 tests.

A single make-up test will be given to deal with all re-evaluation requests. Questions will be drawn from the entire syllabus, rather than the specific portion(s) a student has missed. This test will be held after the end-semester exams are completed for the semester, and on or before the date marked "Last date for showing evaluated answer scripts" in the academic calendar. Students who wish to take this test must plan/arrange to be physically present until the "Last date for showing evaluated answer scripts".

Moodle

Moodle will be the primary course management system. Marks for the assessments will be maintained on the class Moodle page; discussion fora will also be hosted on Moodle. Students who do not have an account on Moodle for the course must send TA Sarvesh Gharat a request by e-mail, specifying the roll number/employee number for account creation.

Academic Honesty

Students are expected to adhere to the highest standards of integrity and academic honesty. Academic violations, as detailed below, will be dealt with strictly, in accordance with the institute's procedures and disciplinary actions for academic malpractice.

Students are expected to work alone on all the programming assignments and the examinations. While they are free to discuss the material presented in class with their peers, they must not discuss the contents of the programming assignments (neither the questions, nor the solutions) with classmates (or anybody other than the instructor and TAs). They must not share code, even if it only pertains to operations that are perceived not to be relevant to the core logic of the assessment (for example, file-handling and plotting). They also may not look at solutions to the given quiz/assignment or related ones on the Internet. Violations will be considered acts of dishonesty.

Students are allowed to use resources on the Internet for programming (say to understand a particular command or a data structure), and also to understand concepts (so a Wikipedia page or someone's lecture notes or a textbook can certainly be consulted). It is also okay to use libraries or code snippets for portions unrelated to the core logic of the assignment—typically for operations such as moving data, network communication, etc. Querying LLMs for code snippets is discouraged, but acceptable for portions unrelated to the core logic of the assignment, as illustrated above. However, students must cite every resource consulted or used, whatever be the reason, in a file named references.txt, which must be included in the submission. If LLMs have been queried, each query must be reported verbatim, along with a link to the LLM user interface. Failure to list any resource or record LLM usage as detailed above will be considered an academic violation.

Copying or consulting any external sources during the examination will be treated as cheating.

If in any doubt as to what is legitimate collaboration and what is not, students must ask the instructor.

Texts and References

Artificial Intelligence: Foundations of Computational Agents, David L. Poole and Alan K. Mackworth, 3^rd edition, Cambridge University Press, 2023. On-line version.

Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto, 2^nd edition, MIT Press, 2018. On-line version.

Algorithms for Reinforcement Learning, Csaba Szepesvári, Morgan & Claypool, 2009. On-line version.

Selected research papers.

On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples
William R. Thompson, 1933
Some Studies in Machine Learning Using the Game of Checkers
Arthur L. Samuel, 1959
Asymptotically Efficient Adaptive Allocation Rules
T. L. Lai and Herbert Robbins, 1985
Self-Improving Reactive Agents Based On Reinforcement Learning, Planning and Teaching
Long-ji Lin, 1992
Practical Issues in Temporal Difference Learning
Gerald Tesauro, 1992
Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning
Ronald J. Williams, 1992
Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results
Sridhar Mahadevan, 1996
Reinforcement Learning with Replacing Eligibility Traces
Satinder P. Singh and Richard S. Sutton, 1996
Elevator Group Control Using Multiple Reinforcement Learning Agents
Robert H. Crites and Andrew G. Barto, 1998
Learning to Trade via Direct Reinforcment
John Moody and Matthew Saffell, 2001
Finite-time Analysis of the Multiarmed Bandit Problem
Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer, 2002
Autonomous helicopter flight via reinforcement learning
Andrew Y. Ng, H. Jin Kim, Michael I. Jordan, and Shankar Sastry, 2003
Tree-based Batch Mode Reinforcement Learning
Damien Ernst, Pierre Geurts, and Louis Wehenkel, 2005
Bandit based Monte-Carlo Planning
Levente Kocsis and Csaba Szepesvári, 2006
Batch Reinforcement Learning in a Complex Domain
Shivaram Kalyanakrishnan and Peter Stone, 2007
Adaptive Treatment of Epilepsy via Batch-mode Reinforcement Learning
Arthur Guez, Robert D. Vincent, Massimo Avoli, and Joelle Pineau, 2008
Self-Optimizing Memory Controllers: A Reinforcement Learning Approach
Engin İpek, Onur Mutlu, José F. Martínez, and Rich Caruana, 2008
Reinforcement learning of motor skills with policy gradients
Jan Peters and Stefan Schaal, 2008
An Empirical Evaluation of Thompson Sampling
Olivier Chapelle and Lihong Li, 2011
The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond
Aurélien Garivier and Olivier Cappé, 2011
Learning Methods for Sequential Decision Making with Imperfect Representations
Shivaram Kalyanakrishnan, 2011
On Optimizing Interdependent Skills: A Case Study in Simulated 3D Humanoid Robot Soccer
Daniel Urieli, Patrick MacAlpine, Shivaram Kalyanakrishnan, Yinon Bentor, and Peter Stone, 2011
Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis
Emilie Kaufmann, Nathaniel Korda, and Rémi Munos, 2012
Human-level control through deep reinforcement learning
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis, 2015
Mastering the game of Go with deep neural networks and tree search
David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis, 2016
Spatial interactions and optimal forest management on a fire-threatened landscape
Christopher J. Lauer, Claire A. Montgomery, and Thomas G. Dietterich, 2017
A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play
David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis, 2018.
Optimising a Real-time Scheduler for Railway Lines using Policy Search
Rohit Prasad, Harshad Khadilkar, Shivaram Kalyanakrishnan, 2020
Deep Reinforcement Learning for Autonomous Driving: A Survey
B Ravi Kiran, Ibrahim Sobh, Victor Talpaert, Patrick Mannion, Ahmad A. Al Sallab, Senthil Yogamani, Patrick Pérez, 2021

Communication

This page will serve as the primary source of information regarding the course, the schedule, and related announcements. The Moodle page for the course will primarily be used for recording grades.

E-mail to the instructor must contain "[CS747]" in the header.

Course videos are available through CDEEP; IIT Bombay students are encouraged to log in and watch them through this site.

Schedule

Week 0

Monday, January 6: Welcome; Introduction to the course.
Lecture 0: Slides.

Thursday, January 9: Problem-solving related to probability, algorithms, programming.
Code: bulls-eye.py.
Week 1

Lecture 1: Video; Slides.
Summary: Coin-tossing game; Definition of stochastic multi-armed bandit; Definition of algorithm, ε-first and ε-greedy algorithms.
Reading: Sections 2, 2.1, 2.2, 2.3, Sutton and Barto (2018).
Code and data: coins.py, biases.txt, eg.py (two bugs in the code demo-ed in the lecture are fixed in this version).
Practice question: Question 1 in 2016 mid-semester paper.

Lecture 2: Video; Slides.
Summary: Graph of E[r^t] versus t; Definition of regret; Achieving sublinear regret with GLIE sampling; Lai and Robbins's lower bound on regret.
References: Class Note 1; Theorem 1, Lai and Robbins (1985).
Practice question: Question 1 in 2018 mid-semester paper.

Test

Week 2

Lecture 1: Video; Slides.
Summary: UCB, KL-UCB, Thompson Sampling algorithms. (Section on concentration bounds in slides not covered in video.)
Reading: Section 1, Figure 1, Theorem 1, Auer et al. (2002); Sections 1–3, Garivier and Cappé (2011); Chapelle and Li (2011).
References: Thompson (1933), Kaufmann et al. (2012).
Practice question: Part a from Week 3 question in 2021 weekly quizzes.

Lecture 2: Video; Slides. (See only Section 1 on concentration bounds. Skip the proof of UCB's regret upper bound.)
Summary: Hoeffding's Inequality, "KL" Inequality.
Reading: Wikipedia page on Hoeffding's Inequality.
References: Hoeffding (1963); Mulzer (2019).
Practice question: Question 1 in 2017 mid-semester paper.

Monday, January 20: Review and tutorial.

Thursday, January 23: Test (graded by Vedang Gupta).
Week 3

Lecture 1: Video; Slides.
Summary: Proof of upper bound on the regret of UCB. (See only Section 2 on the proof of UCB's regret upper bound.)
Reading: Proof of Theorem 1, Auer et al. (2002).

Lecture 2: Video; Slides.
Summary: Interpretation of Thompson Sampling; Survey of bandit formulations.
Reference: Wikipedia page on Bayesian inference.
Practice question: Week 3 question in 2020 weekly quizzes.

Monday, January 27: Review and tutorial.

Thursday, January 30: Test (graded by Sarvesh Gharat).
Week 4

Lecture 1: Video; Slides.
Summary: Definition of Markov Decision Problem, policy, and value function; Existence of optimal policy; MDP planning problem; Bellman equations.
Reading: Chapter 3, Sutton and Barto (2018).
Practice question: Question 3 in 2015 mid-semester paper.

Lecture 2: Video; Slides.
Summary: Continuing and episodic tasks; Infinite-discounted, total, finite-horizon, and average reward structures; Applications of MDPs
References: Section 2.2, Mahadevan (1996); Ng et al. (2003); Lauer et al. (2017).
Practice question: Question 7 in 2018 end-semester paper.

Monday, February 3: Review and tutorial.
Code: mdp-simulate.py.

Thursday, February 6: Test (graded by Sandarbh Yadav).
Week 5

Lecture 1: Video; Slides.
Summary: Banach's Fixed-point Theorem; Bellman optimality operator; Proof of contraction under max norm; Value iteration.
Reading: Appendix A, Szepesvári (2009).
Practice question: Week 5 question in 2021 weekly quizzes.

Lecture 2: Video; Slides. There is a typo in constraint C2 of the LP example. See Slide 4 in the 2023 slides for the correct version.
Summary: Linear programming and its application to MDP planning.
Reference: Littman et al. (1995).
Practice question: Question 4 in 2020 end-semester paper.

Monday, February 10: Review and tutorial.
Code: vi.py.

Thursday, February 13: Test (graded by Anvay Shah).
Week 6

Lecture 1: Video; Slides.
Summary: Action value function; Policy improvement; Bellman operator; Proof of Policy improvement theorem; Policy Iteration family of algorithms; Effect of history and stochasticity.
Reading: Sections 1 and 2, Kalyanakrishnan et al. (2016).
Reference: Class Note 2.
Practice question: Week 6 question in 2020 weekly quizzes.
Lecture 2: Video; Slides.
Summary: Complexity bounds for MDP planning; Analysis of Howard's PI and Batch-switching PI.
References: Howard (1960); Melekopoglou and Condon (1994); Mansour and Singh (1999); Hollanders (2012); Hansen et al. (2014); Hollanders et al. (2014); Gerencsér et al. (2015) Kalyanakrishnan et al. (2016); Kalyanakrishnan et al. (2016a); Gupta and Kalyanakrishnan (2017); Taraviya and Kalyanakrishnan (2019); Ashutosh et al. (2020).

Monday, February 17: Review and tutorial.

Thursday, February 20: Test (graded by Vedang Gupta).
Mid-semester examination
6.00 p.m. – 8.00 p.m., Thursday, February 27. LA 001 and LA 002.
Week 7

Monday, March 3: Lecture 1 (in class). Slides.
Summary: The Reinforcement Learning problem; Upcoming topics; Applications.
References: Tesauro (1992); Silver et al. (2018); Ng et al. (2003); Mnih et al. (2015); İpek et al. (2008); Guez et al. (2008); Moody and Saffell (2001).

Thursday, March 6: Lecture 2 (in class). Slides.
Summary: Prediction and control problems; Ergodic MDPs; Model-based algorithm for acting optimally in the limit.
Reading: Class Note 3.
Reference: Wikipedia page on Ergodic Markov chains.
Practice question: Question 3d in 2015 mid-semester paper.
Week 8

Monday, March 10: Lecture 1 (in class). Slides.
Reading: Sections 5, 5.1, Sutton and Barto (2018).
References: Robbins and Monro (1951), Singh and Sutton (1996).
Code: montecarlo.py.
Practice question: Question 2 in 2015 end-semester paper.