CS 747: Foundations of Intelligent and Learning Agents
(Autumn 2025)
(Picture source: https://www.flickr.com/photos/wbaiv/26563345952/in/photostream/ shared under CC BY 4.0.)
(Page last edited .)
Instructor
Shivaram Kalyanakrishnan
Office: 220, CC Building
Phone: 7704
E-mail: shivaram@cse.iitb.ac.in
Teaching Assistants
Sandarbh Yadav
E-mail: 22d0374@iitb.ac.in
Anvay Shah
E-mail: anvay@cse.iitb.ac.in
Ishaan Chandra
E-mail: ishaan@cse.iitb.ac.in
Ankish Chandresh
E-mail: ankish.chandresh@iitb.ac.in
Aditya Singh
E-mail: 22b1844@iitb.ac.in
N Gurupoorna
E-mail: 22b1831@iitb.ac.in
Urvi Gupta
E-mail: 22b1006@iitb.ac.in
Aryan Gupta
E-mail: 22b2255@iitb.ac.in
Dheeraj Kurukunda
E-mail: 22b0935@iitb.ac.in
Sai Sriram Sambaraju
E-mail: 22b0940@iitb.ac.in
Meetings
Meetings will be held during Slot 14: 5.30 p.m. – 6.55
p.m. Tuesdays and Fridays in LA 001. The instructor will be
available for consultation immediately following class, up to 7.30
p.m., on both Tuesdays and Fridays. He will also hold office hours
(220, CC Building) 9.15 a.m. – 10.15 a.m. Wednesdays.
Course Description
Today's computing systems are increasingly adaptive and autonomous:
they are akin to intelligent, decision-making "agents". With its roots
in artificial intelligence and machine learning, this course covers
the foundational principles of designing such agents. Topics covered
include: (1) agency, intelligence, and learning; (2) exploration and
multi-armed bandits; (3) Markov Decision Problems and planning; (4)
reinforcement learning; (5) multi-agent systems and multi-agent
learning; and (6) case studies.
The course will adopt a "hands-on" approach, with programming
assignments designed to highlight the relationship between theory and
practice. Case studies will offer an end-to-end view of deployed
agents. It is hoped that students can apply the learnings from this
course to the benefit of their respective pursuits in various areas of
computer science and related fields.
The Autmn 2025 offering of the course will run in the flipped
classroom format. After initial meetings to introduce the
course, each "week" will begin with video lectures (of length
2–3 hours) being made available for the students to
watch. Students are expected to view the lectures and go through
reading material that is provided alongside. The first meeting of
the week will be a "review and tutorial" session, in which the
contents of the lectures will be reviewed and questions from the
students addressed. The instructor will also take up
problem-solving and/or programming exercises in this session to
demonstrate and reinforce the concepts covered in the lectures. The
second meeting of the week will be for a test based on the week's
lectures.
Eligibility
The course is open to all Ph.D. students, all masters students, and
undergraduate/dual-degree students in their fourth or higher year of
study. The course is also open to undergraduate/dual-degree students
in their third year of study, although an institute-mandated limit on
the class size often means some undergraduate/dual-degree students in
their third year may not be able to register. Registration and
waitlists are managed by ASC; the instructor will not intervene to
make special allowances for any student.
Prerequisites
The course does not formally have other courses as
prerequisites. However, lectures and assignments will assume that the
student is comfortable with probability and
algorithms. Introduction
to Probability by Grinstead and Snell is an excellent resource on
basic probability. Any student who is not comfortable with the
contents of chapters 1 through 7 (and is unable to solve the
exercises) is advised against taking CS 747.
The course has an intensive programming component: based on ideas
discussed in class, the student must be able to independently design,
implement, and evaluate programs in python. The student
must be prepared to spend a significant amount of time on the
programming component of the course.
Students who are unsure about their preparedness for taking the
course are strongly advised to watch the lectures from week 1, 2, and
3 from
the Autumn
2020 offering, to attempt the quizzes from those weeks, and also
to go through Programming Assignment 1. If they are unable to get a
reasonable grasp of the material or to negotiate the quizzes and
programming assignment, they are advised against taking CS 747.
Evaluation
Grades will be based on (1) weekly tests, together contributing 24
marks; (2) three programming assignments, each worth 12 marks; (3) a
mid-semester examination worth 15 marks; and (4) an end-semester
examination worth 25 marks. All assessments will be based on
individual work.
There will be 10 or more weekly tests, each for 3 marks. The 8 best
scores from these tests will count towards the aggregate of 24.
The programming assignments must be turned in through Moodle. Late
submissions will not be evaluated; they will receive no marks.
Students auditing the course must score 50 or more marks in the
course to be awarded an "AU" grade.
Evaluation will be contingent on the student agreeing to comply
with the course policies on academic honesty and submissions.
Make-up Test
Students who encounter any medical issues during the course must
write to the instructor as soon as possible with an official record
of their sickness. If they are unable to appear in either of the
exams, or to submit any of the three programming assignments, due to
sickness, they may request to be re-evaluated.
Weekly tests that are missed due to medical issues will not be
compensated, unless a student has fewer than 8 tests for which they
were medically fit. Hence, for example, if 11 weekly tests were
conducted, of which the student was medically fit for 9, then they
will not have a make-up for the weekly tests. On the other hand, a
student who was medically fit for only 5 tests can make up for 3
tests.
A single make-up test will be given to deal with all re-evaluation
requests. Questions will be drawn from the entire syllabus, rather
than the specific portion(s) a student has missed. This test will be
held after the end-semester exams are completed for the semester,
and on or before the date marked "Last date for showing evaluated
answer scripts" in the academic calendar. Students who wish to take
this test must plan/arrange to be physically present until the "Last
date for showing evaluated answer scripts".
Moodle
Moodle will be the primary course management system. Marks for the
assessments will be maintained on the class Moodle page; discussion
fora will also be hosted on Moodle. Students who do not have an
account on Moodle for the course must send TA Ishaan Chandra a
request by e-mail, specifying the roll number/employee number for
account creation.
Academic Honesty
Students are expected to adhere to the highest standards of
integrity and academic honesty. Academic violations, as detailed
below, will be dealt with strictly, in accordance with the
institute's procedures
and disciplinary
actions for academic malpractice.
Students are expected to work alone on all the programming
assignments and the examinations. While they are free to discuss the
material presented in class with their peers, they must not discuss
the contents of the programming assignments (neither the questions, nor the
solutions) with classmates (or anybody other than the instructor and
TAs). They must not share code, even if it only pertains to
operations that are perceived not to be relevant to the core
logic of the assessment (for example, file-handling and
plotting). They also may not look at solutions to the given
quiz/assignment or related ones on the Internet. Violations will be
considered acts of dishonesty.
Students are allowed to use resources on the Internet for
programming (say to understand a particular command or a data
structure), and also to understand concepts (so a Wikipedia page or
someone's lecture notes or a textbook can certainly be
consulted). It is also okay to use libraries or code snippets for
portions unrelated to the core logic of the
assignment—typically for operations such as moving data,
network communication, etc. Querying LLMs for code snippets is
discouraged, but acceptable for portions unrelated to the core logic
of the assignment, as illustrated above. However, students must
cite every resource consulted or used, whatever be the
reason, in a file named references.txt, which must be
included in the submission. If LLMs have been queried, each query
must be reported verbatim, along with a link to the LLM user
interface. Failure to list any resource or record LLM usage as
detailed above will be considered an academic violation.
Copying or consulting any external sources during the examination
will be treated as cheating.
If in any doubt as to what is legitimate collaboration and what is
not, students must ask the instructor.
Texts and References
Artificial Intelligence: Foundations of Computational Agents, David
L. Poole and Alan K. Mackworth, 3rd edition, Cambridge
University Press,
2023. On-line
version.
Reinforcement Learning: An Introduction, Richard S. Sutton and
Andrew G. Barto, 2nd edition, MIT Press,
2018. On-line
version.
Algorithms for Reinforcement Learning, Csaba Szepesvári,
Morgan & Claypool,
2009. On-line
version.
Selected research papers.
-
On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples
William R. Thompson, 1933
-
A
Stochastic Approximation Method
Herbert Robbins and Sutton Monro, 1951
-
Probability inequalities for sums of bounded random variables
Wassily Hoeffding, 1963
-
Asymptotically
Efficient Adaptive Allocation Rules
T. L. Lai and Herbert Robbins, 1985
-
Self-Improving Reactive Agents Based On Reinforcement Learning, Planning and Teaching
Long-ji Lin, 1992
-
Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning
Ronald J. Williams, 1992
-
Markov games as a framework for multi-agent reinforcment learning
Michael L. Littman, 1994
-
On the Complexity of Solving Markov Decision Problems
Michael L. Littman, Thomas L. Dean, and Leslie Pack Kaelbling, 1995
-
Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results
Sridhar Mahadevan, 1996
-
Elevator Group Control Using Multiple Reinforcement Learning Agents
Robert H. Crites and Andrew G. Barto, 1998
-
Rational and Convergent Learning in Stochastic Games
Michael Bowling and Manuela Veloso, 2001
-
Finite-time
Analysis of the Multiarmed Bandit Problem
Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer, 2002
-
Tree-based Batch Mode Reinforcement Learning
Damien Ernst, Pierre Geurts, and Louis Wehenkel, 2005
-
Bandit based Monte-Carlo Planning
Levente Kocsis and Csaba Szepesvári, 2006
-
If multi-agent learning is the answer, what is the question?
Yoav Shoham, Rob Powers, and Trond Grenager, 2006
-
Half
Field Offense in RoboCup Soccer: A Multiagent Reinforcement Learning
Case Study
Shivaram Kalyanakrishnan, Yaxin Liu, and Peter Stone, 2007
-
An Empirical Evaluation of Thompson Sampling
Olivier Chapelle and Lihong Li, 2011
-
Learning Methods for Sequential Decision Making with Imperfect Representations
Shivaram Kalyanakrishnan, 2011
-
The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond
Aurélien Garivier and Olivier Cappé, 2011
-
On Optimizing Interdependent Skills: A Case Study in Simulated 3D Humanoid Robot Soccer
Daniel Urieli, Patrick MacAlpine, Shivaram Kalyanakrishnan, Yinon Bentor, and Peter Stone, 2011
-
Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis
Emilie Kaufmann, Nathaniel Korda, and Rémi Munos, 2012
-
Deterministic Policy Gradient Algorithms
David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller, 2014
-
Human-level control through deep reinforcement learning
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel
Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas
K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir
Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan
Wierstra, Shane Legg, and Demis Hassabis, 2015
-
Randomised Procedures for Initialising and Switching Actions in Policy Iteration
Shivaram Kalyanakrishnan, Neeldhara Misra, and Aditya Gopalan, 2016
-
Spatial interactions and optimal forest management on a fire-threatened landscape
Christopher J. Lauer, Claire A. Montgomery, and Thomas G. Dietterich, 2017
-
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov, 2017
-
Five Proofs of Chernoff's Bound with Applications
Wolfgang Mulzer, 2019
-
Optimising a Real-time Scheduler for Railway Lines by Policy Search
Rohit Prasad, Harshad Khadilkar, Shivaram Kalyanakrishnan, 2020
-
FinRL: A Deep Reinforcement Learning Library for Automated Stock Trading in Quantitative Finance
Xiao-Yang Liu, Hongyang Yang, Qian Chen, Runjia Zhang, Liuqing Yang, Bowen Xiao, Christina Dan Wang, 2022
Communication
This page will serve as the primary source of information regarding
the course, the schedule, and related announcements. The Moodle page
for the course will primarily be used for recording grades.
E-mail to the instructor must contain "[CS747]" in the header.
Course videos are available through CDEEP; IIT Bombay students are
encouraged to log in and watch them
through this site.
Schedule
-
Week 0
Tuesday, July 29: Welcome; Introduction to the course.
Lecture 0: Slides.
Friday, August 1: Problem-solving related to probability, programming.
Code: throws.py.
-
Week 1
Lecture 1: Video; Slides.
Summary: Coin-tossing game; Definition of stochastic multi-armed bandit; Definition of algorithm, ε-first and ε-greedy algorithms.
Reading: Sections 2, 2.1, 2.2, 2.3, Sutton and Barto (2018).
Code and data: coins.py, biases.txt, eg.py (two bugs in the code demo-ed in the lecture are fixed in this version).
Practice question: Question 1 in 2016 mid-semester paper.
Lecture 2: Video; Slides.
Summary: Graph of E[rt] versus t; Definition of regret; Achieving sublinear regret with GLIE sampling; Lai and Robbins's lower bound on regret.
References: Class Note 1; Theorem 1, Lai and Robbins (1985).
Practice question: Question 1 in 2018 mid-semester paper.
Tuesday, August 5: Review and tutorial.
Friday, August 8: Test (graded by Ishaan Chandra).
-
Week 2
Lecture 1: Video; Slides.
Summary: UCB, KL-UCB, Thompson Sampling algorithms. (Section on concentration bounds in slides not covered in video.)
Reading: Section 1, Figure 1, Theorem 1, Auer et al. (2002); Sections 1–3, Garivier and Cappé (2011); Chapelle and Li (2011).
References: Thompson (1933), Kaufmann et al. (2012).
Practice question: Part a from Week 3 question in 2021 weekly quizzes.
Lecture 2: Video; Slides. (See only Section 1 on concentration bounds. Skip the proof of UCB's regret upper bound.)
Summary: Hoeffding's Inequality, "KL" Inequality.
Reading: Wikipedia page on Hoeffding's Inequality.
References: Hoeffding (1963); Mulzer (2019).
Practice question: Question 1 in 2017 mid-semester paper.
Tuesday, August 12: Review and tutorial.
(Friday, August 15 is a holiday. There will be no meeting.)
Tuesday, August 19: Test (graded by N Gurupoorna).
-
Week 3
Lecture 1: Video; Slides.
Summary: Proof of upper bound on the regret of UCB. (See only Section 2 on the proof of UCB's regret upper bound.)
Reading: Proof of Theorem 1, Auer et al. (2002).
Lecture 2: Video; Slides.
Summary: Interpretation of Thompson Sampling; Survey of bandit formulations.
Reference: Wikipedia page on Bayesian inference.
Practice question: Week 3 question in 2020 weekly quizzes.
Friday, August 22: Review and tutorial.
Code: prob-ub.py.
Tuesday, August 26: Test (graded by Sai Sriram Sambaraju).
-
Week 4
Lecture 1: Video; Slides.
Summary: Definition of Markov Decision Problem, policy, and value function; Existence of optimal policy; MDP planning problem; Bellman equations.
Reading: Chapter 3, Sutton and Barto (2018).
Practice question: Question 3 in 2015 mid-semester paper.
Lecture 2: Video; Slides.
Summary: Continuing and episodic tasks; Infinite-discounted, total,
finite-horizon, and average reward structures; Applications of MDPs
References: Section 2.2, Mahadevan (1996); Ng et al. (2003); Lauer et al. (2017).
Practice question: Question 7 in 2018 end-semester paper.
Friday, August 29: Review and tutorial.
Tuesday, September 2: Test (graded by Urvi Gupta).
-
Weeks 5 and 6 (combined)
Lecture 1: Video; Slides.
Summary: Banach's Fixed-point Theorem; Bellman optimality operator; Proof of contraction under max norm; Value iteration.
Reading: Appendix A, Szepesvári (2009).
Practice question: Week 5 question in 2021 weekly quizzes.
Lecture 2: Video; Slides. There is a typo in constraint C2 of the LP example. See Slide 4 in the 2023 slides for the correct version.
Summary: Linear programming and its application to MDP planning.
Reference: Littman et al. (1995).
Practice question: Question 4 in 2020 end-semester paper.
Lecture 3: Video; Slides.
Summary: Action value function; Policy improvement; Bellman operator; Proof of Policy improvement theorem; Policy Iteration family of algorithms; Effect of history and stochasticity.
Reading: Sections 1 and 2, Kalyanakrishnan et al. (2016).
Reference: Class Note 2.
Practice question: Week 6 question in 2020 weekly quizzes.
(Friday, September 5 is a holiday. There will be no meeting.)
Tuesday, September 9: Review and tutorial (lectures 1 and 2).
Code: bellman-optimality-operator.py.
Friday, September 12: Review and tutorial (Lecture 3).
-
Mid-semester examination
6.30 p.m. – 8.30 p.m., Sunday, Sepember 14. LA 001 and LA 002.
A single 3-mark test for weeks 5 and 6 combined will be included in the mid-semester paper.
-
Week 7
Tuesday, September 23: Lecture 1 (in class).
Lecture 1: Slides.
Summary: The Reinforcement Learning problem; Upcoming topics; Applications.
References: Tesauro (1992); Silver et al. (2018); Ng et al. (2003); Mnih et al. (2015); İpek et al. (2008); Guez et al. (2008); Moody and Saffell (2001).
Lecture 2: Video, Slides.
Summary: Prediction and control problems; Ergodic MDPs; Model-based algorithm for acting optimally in the limit; Monte Carlo methods for prediction.
Reading: Class Note 3; Sections 5, 5.1, Sutton and Barto (2018).
Reference: Wikipedia page on Ergodic Markov chains.
Friday, September 26: Review and tutorial (Lecture 2).
Tuesday, September 30: Test (graded by Dheeraj Kurukunda).
-
Week 8
Lecture 1: Video, Slides.
Summary: Maximum likelihood estimates and least squares estimates; On-line implementation of Monte Carlo policy evaluation; Bootstrapping; TD(0) algorithm; Convergence of Monte Carlo and batch TD(0) algorithms; Model-free control: Q-learning, Sarsa, Expected
Sarsa.
Reading: Sections 6, 6.1, 6.2, 6.3, 6.4, 6.5, 6.6, Sutton and
Barto (2018).
Reference: Robbins and Monro (1951).
Friday, October 3: Review and tutorial.
Code: control.py.
Tuesday, October 7: Test (graded by Aryan Gupta).
-
Week 9
Lecture 1: Video, Slides.
Summary: n-step returns; TD(λ) algorithm; Need for generalisation in practice; Soccer as illustrative example; Linear function approximation and geometric view; Linear TD(λ).
Reading: Sections 7, 7.1, 9, 9.1, 9.2, 9.3, 9.4, 12, 12.1, 12.2,
Sutton and Barto (2018).
Reference: Kalyanakrishnan et al. (2007).
Friday, October 10: Review and tutorial.
References: Crites and Barto (1998), Liu et al. (2022).
Tuesday, October 14: Test (graded by Ankish Chandresh).
-
Week 10
Lecture 1: Video, Slides.
Summary: Tile coding; Control with function approximation;
Tsitsiklis and Van Roy's counterexample; Policy search; Case
studies: humanoid robot soccer, railway scheduling.
Reading: Sections 9.5, 9.6, 9.7, 11, 11.1, 11.2, 11.3, Sutton and
Barto (2018).
References: Urieli et al. (2011), Prasad et al. (2020).
Friday, October 17: Review and tutorial.
Reference: Chapter 4, Kalyanakrishnan (2011).
Tuesday, October 21: Test (graded by Aditya Singh).
-
Week 11
Lecture 1: Video, Slides.
Summary: Policy gradient methods; Policy gradient theorem; REINFORCE; REINFORCE with a baseline; Actor-critic methods; Batch RL; Experience replay; Fitted Q Iteration.
Reading: Sections 13, 13.1, 13.2, 13.3, 13.4, 13.5, Sutton and Barto (2018), Kalyanakrishnan and Stone (2007).
References: Williams (1992), Lin (1992), Ernst et al. (2005).
Friday, October 24: Review and tutorial.
References: Silver et al. (2014), Mnih et al. (2015), Schulman et al. (2017)
Tuesday, October 28: Test (graded by Ishaan Chandra).
-
Week 12
Lecture 1: Video, Slides.
Summary: Uninformed search strategies, A* search, Game trees, Decision-time planning, UCT algorithm.
Reading: Chapter 3, Poole and Mackworth (2017); Sections 8, 8.1, 8.8, 8.9, 8.10, 8.11, Sutton and Barto (2018).
Reference: Kocsis and Szepesvári (2006).
Friday, October 31: Review and tutorial.
Tuesday, November 4: Test.
-
Week 13
Lecture 1: Video, Slides.
Summary: Uninformed search strategies, A* search, Game trees, Decision-time planning, UCT algorithm.
Reading: Littman, 1994.
References: Bowling and Veloso (2001); Shoham, Powers, and Grenager (2006).
Friday, November 7: Review and tutorial.
Assignments