This assignment tests your understanding of the regret minimisation algorithms discussed in class, and ability to extend them to different scenarios. There are 3 tasks, which add up to 12 marks. To begin, in Task 1, you will implement UCB, KL-UCB, and Thompson Sampling, more or less identical to the versions discussed in class. Task 2 involves a variation of the multi-armed bandit problem in an interactive environment with rewards from the Poisson distribution. Task 3 involves optimisation of the KL-UCB algorithm.
All the code you write for this assignment must be in Python 3.9.6 or greater. You can install Python 3.9.6 for your system from here.
Your code will be tested using a python virtual environment. To set up the virtual environment, follow the instructions provided in virtual-env.txt
which is included in the compressed directory linked below.
This compressed directory has all the python files required. bernoulli_bandit.py
defines the BernoulliBandit which, strictly speaking, you do not need to worry about. It is, however, advised that you read through the code to understand how the pull
and other functions work. We have also provided simulator.py
to run simulations and generate plots, which you'll have to submit as described later. Finally, there's autograder.py
which evaluates your algorithms for a fixed few bandit instances, and outputs the score you would have received if we were evaluating your algorithms on these instances. The only files you need to edit are task1.py
, task2.py
, and task3.py
. Do not edit any other files. You are allowed to comment/uncomment the final few lines of simulator.py
which you can use to generate the plots for task 1 and task 3. For bonus part also, if you have implemented it, then you can run the simulator for that also and check the plots. For task 2 simulator.py
can be used to test your algorithm on custom testcases.
For evaluation, we will use another set of bandit instances in the autograder, and use its score as is for approximately 75% of the evaluation. So if your code produces an error, it will directly receive a 0 score in this part. It will also get 0 marks if for any testcase whatsoever, the autograder takes over 5 minutes to run the testcase. The remaining part of the evaluation will be done based on your report, which includes plots, and explanation of your algorithms. See the exact details below.
For task 1 and task 3, you can expect that the number of arms in the bandit instances used in our undisclosed test cases will be at most 40, and similarly the horizon at most 20,000. (Note that longer horizons are used in your plots, which might take some time to generate). For task 2 the means of the doors have positive values less than 3, and the number of doors is at most 30.
To test your implementation against the given testcases, run the autograder as follows: python3 autograder.py --task TASK
.
Here TASK
can be one of: 1, 2, 3
or all
. If TASK
is 1
, then you also need to provide another argument: --algo ALGO
, where ALGO
can be one of: ucb, kl_ucb, thompson
, or all
.
Your code will be evaluated using the python virtual environment. Testing your code using the activated python virtual environment on your machine should be sufficient for your code to work during evaluation.
During evaluation we will be running your submitted files with our autograder framework which is a superset of the one we have provided you. Hence, you must make sure that your submitted files are compatible with the original framework.
In this first task, you will implement the sampling algorithms: (1) UCB, (2) KL-UCB, and (3) Thompson Sampling. This task is straightforward based on the class lectures. The instructions below tell you about the code you are expected to write.
Read task1.py
. It contains a sample implementation of epsilon-greedy for you to understand the Algorithm
class. You have to edit the __init__
function to create any state variables your algorithm needs, and implement give_pull
and get_reward
. give_pull
is called whenever it's the algorithm's decision to pick an arm and it must return the index of the arm your algorithm wishes to pull (lying in 0, 1, ... self.num_arms-1). get_reward
is called whenever the environment wants to give feedback to the algorithm. It will be provided the arm_index
of the arm and the reward
seen (0/1). Note that the arm_index
will be the same as the one returned by the give_pull
function called earlier. For more clarity, refer to single_sim
function in simulator.py
.
Once done with the implementations, you can run simulator.py
to see the regrets over different horizons. Save the generated plot and add it your report, with apt captioning. In your report, also describe any details which are relevant to your particular implementation, such as parameter values you have chosen/tuned, or a certain way in which you have coded some function. You may also run autograder.py
to evaluate your algorithms on the provided test instances.
In this task you will interact with a stochastic environment, illustrated in the figure below.
You are in a room with multiple doors, and you must break through at least one door in order to move to the other side. Each door begins with a strength of 100. At every step, you may choose one door to strike. The damage dealt to a door is drawn from a Poisson distribution, with the mean value specific to that door. The door's remaining strength is then reduced by the realised damage.
For example, if the current strength of a door is \(H_i\) and its associated mean is \(\mu_i\), and you choose to strike this door, then the new strength is \(H_i - \alpha\), where \(\alpha \sim \text{Poisson}(\mu_i)\). The game terminates immediately once the strength of any door drops below zero.
A problem instance for this task is defined by an array of \(n\) positive real numbers, where \(n\) is the number of doors and each element represents the mean of the Poisson distribution for that door. The means of the doors have values in (0, 3) (that is, positive values less than 3), and the number of doors is at most 30. Your objective is to design a strategy that breaks through a door in the minimum expected number of strikes.
Develop a selection policy that minimises the expected number of strikes (steps) required for the episode to terminate. In other words, your policy should efficiently identify and focus on doors with higher expected damage, while accounting for the fact that the game ends as soon as any door's strength falls below zero.
task2.py
: Contains the implementation of the Poisson environment. You must implement your algorithm by modifying the StudentPolicy
class. By default, it contains a simple Epsilon-Greedy implementation.door_game.py
: An interactive simulator where you can play the game manually or run your algorithm to visualise its behaviour. This file is provided only for exploration and analysis; it is not used for grading.simulator.py
: Runs simulations of your policy in the environment. Need not be modified. To test on custom testcases make changes in the main function. door_game.py
)
The simulator allows you to visualise your policy. It displays the doors, their remaining strength, average losses, and the total number of strikes. You can either play manually or run your StudentPolicy
implementation to observe its decisions in real time.
pygame
is installed in your Python environment.python door_game.py
to open the simulator window.mus
and H0
values defined in the file). You may modify these to test different scenarios.StudentPolicy
automatically. Allows pausing and single-stepping through decisions.Key / Mouse | Action | Notes |
---|---|---|
C |
Switch to Code Mode | Runs the game using your StudentPolicy . |
M |
Switch to Manual Mode | Allows you to select doors manually. |
Space |
Play / Pause (Code Mode) | Toggles continuous play when in Code Mode. |
N |
Single Step (Code Mode, paused) | Executes exactly one policy decision. |
R |
Reset Environment | Resets the current instance with the same parameters. |
P |
Load Sample Instance | Loads a predefined test case for quick experimentation. |
Mouse Click | Strike a door (Manual Mode) | Applies a Poisson-distributed hit to the selected door. |
Window Close | Quit | Closes the simulator. |
StudentPolicy
and press C to switch to Code Mode. Start the simulation by pressing Space.mus
array in main
. The length of this array determines the number of doors.Your algorithm will be evaluated using the autograder. Each test case specifies a threshold for the maximum number of strikes allowed. For evaluation, the average number of strikes over 200 episodes is measured. Your policy must achieve an average below the threshold to pass the test case.
All test cases must complete execution within 5 minutes. This limit is intentionally generous; it ensures that algorithms do not run excessively long.
For Task 2, you only need to edit task2.py
. The door_game.py
file is provided exclusively for visualisation and debugging of your policy.
Read this page on Poisson distribution for insights that may help in designing your algorithm. You are also encouraged to explore the academic literature. Be sure to cite any references you have consulted in references.txt
(see under "Submission").
In this task you will optimise the computations of the KL-UCB algorithm for multi-armed bandits. The goal is to reduce the computational overhead of KL-UCB while maintaining good regret performance. You will implement an optimised version of KL-UCB that achieves significant speedup.
For this task, you must edit the file task3.py
. You need to implement one algorithm class:
Key Requirements:
testcases
folder.testcases
folder, over the standard KL_UCB.Hint For Optimisation: In a typical run of KL-UCB, there will be long sequences in which the same arm gets pulled, before switching to another one. On a given decision-making step, can you decide to give an arm many pulls (say m >= 2) without consulting the data and re-calculating UCBs while performing these m pulls?
Theoretical Question and Bonus Problem: Suppose we run a deterministic algorithm, and also fix the random seed to generate rewards from the bandit instance. Then on each run, we will get the same action-reward sequence a0, r0, a1, r1, a2, r2, .... Can you think of a computational optimisation of KL-UCB such that both algorithms (original and optimised) are guaranteed to produce the same action-reward sequence when the seed is fixed? In other words, if the run of KL-UCB picks arms 1, 3, 4, 2, 1 in the first five pulls, and gets rewards of 0, 1, 1, 0, 1, so must your optimised variant of KL-UCB, which will typically run faster. In turn, this will mean that both algorithms have the same cumulative regret vs. horizon graph (except possibly for numerical differences on account of floating point operations). How would you implement such an algorithm? Answer this question in your report. This question is for 1 mark. In task3.py
, a bonus problem is also given, if you have the answer to this optimisation problem, you can implement it in the KL_UCB_Optimized_Bonus
class in task3.py
. This is an optional problem and will not affect your marks for task 3 if you do not implement it.
You only need to implement the algorithm classes in task3.py
and make changes to the code in task3.py
only. For simulation, you can use the simulator.py
function.
For Task 1, your report needs to have all the plots (UCB, KL-UCB, Thompson) that simulator.py
generates (uncomment the final lines). There are 3 plots in total for task 1. You may, of course, include any additional plots you generate. Your plots should be neatly labelled and captioned for the report. Also explain your code for the three algorithms.
For Task 2, your report should present a clear explanation of your overall approach for the problem with Poisson Doors, describe in detail the policy you implemented and the reasoning behind its design, and provide a well-argued justification of why your method achieves good performance in this setting, both in terms of theoretical soundness and practical effectiveness.
For Task 3, you need to explain your optimisation approach, provide performance analysis comparing standard vs optimised algorithms, and discuss the techniques used to achieve speedup while maintaining regret optimality. You need to include the plots in the report which are generated after running the simulator.py
for task3.
You have to submit one tar.gz file with the name (your_roll_number).tar.gz
. Upon extracting, it must produce a folder with your roll number as its name. It must contain a report.pdf
- the report as explained above, and three code files: task1.py, task2.py, task3.py
. You must also include a references.txt
file if you have referred to any resources while working on this assignment (see the section on Academic Honesty on the course web page).
The assignment is worth 12 marks. For Task 1: 3 marks are for the code and will be evaluated by the autograder, and 1 mark is for the report. For Tasks 2 and 3: 3 marks are for the code and will be evaluated by the autograder, and 1 mark is for the report and the explanation of your algorithm.
For Task 1, we will use the following marking system: "FAILED" testcases contribute 0 marks, and "PASSED" testcases contribute 1 mark, the total of which will then be normalised to the task's assigned marks depending on the number of testcases. The visible testcases will contribute to 25% of the marks and the rest 75% will come from the hidden testcases that we will use once you have submitted the assignment.
Task 2 marks distribution:
Task 3 is worth 4 marks distributed as follows:
Bonus is of 1 mark for implementing the optimised KL-UCB algorithm that achieves the required speedup threshold and stays within the regret threshold as defined in the testcases. But, the maximum you can get in this assignment is 12 marks only. So, if you get full marks in all the tasks and also get the bonus mark, then also you will be given 12 marks only.
The autograder used for evaluating will use a different set of test instances from the ones provided. Note that runtime errors in the script will lead to 0 marks for that test instance. You will also be given 0 marks for a test instance if your code takes more than 5 minutes to run for a test instance.
Your submission is due by 11.59 p.m., Thursday, September 11, 2025. Finish working on your submission well in advance, keeping enough time to test your code, generate your plots, compile the results, and upload to Moodle.
Your submission will not be evaluated (and will be given a score of zero) if it is not uploaded to Moodle by the deadline. Do not send your code to the instructor or TAs through any other channel. Requests to evaluate late submissions will not be entertained.
Your submission will receive a score of zero if your code does not execute using the given python virtual environment. To make sure you have uploaded the right version, download it and check after submitting (but well before the deadline, so you can handle any contingencies before the deadline lapses).
You are expected to comply with the rules laid out in the "Academic Honesty" section on the course web page, failing which you are liable to be reported for academic malpractice.