This assignment tests your understanding of the regret minimisation algorithms discussed in class, and ability to extend them to different scenarios. There are three tasks, each worth 5 marks. To begin, in Task 1, you will implement UCB, KL-UCB, and Thompson Sampling, more or less identical to the versions discussed in class. Task 2 involves coming up with an algorithm for batched sampling. The idea is that at every decision making step, the algorithm must specify an entire batch of arms to pull (for example, if the batch size is 100, it could be split as perhaps 25 pulls for arm 1, 55 pulls for arm 2, and 20 pulls for arm 3). All these pulls are performed and the results returned to the algorithm in aggregate before its next batch of pulls. In Task 3, you need to come up with an algorithm for the case when the horizon is equal to the number of arms, but it is given that the arm means are distributed uniformly (so if the horizon is 100, the arm means are a permutation of [0, 0.01, 0.02, ..., 0.99]). The theory developed in class applies meaningfully only when the horizon is large; how would you deal with shorter horizons? The fact that the bandit instance comes from a restricted family could possibly help.
All the code you write for this assignment must be in Python 3.8.10. The only libraries you may use are Numpy v1.21.0 (to work with vectors and matrices) and Matplotlib (for generating plots). All of these come installed with the docker image that has been shared for the course, and are already imported in the files you need to complete.
This compressed directory has all
the files
required. bernoulli_bandit.py
defines
the BernoulliBandit which, strictly speaking, you do
not need to worry about. It is, however, advised
that you read through the code to understand how
the pull
and batch_pull
functions work. We have also
provided simulator.py
to run
simulations and generate plots, which you'll have to
submit as described later. Finally, there's
autograder.py
which evaluates your algorithms
for a fixed few bandit instances, and outputs the score
you would have received if we were evaluating your
algorithms on these instances. The only files you need to
edit are task1.py
, task2.py
,
and task3.py
. Do not edit any other
files. You are allowed to comment/uncomment the final few
lines of simulator.py
. It is strongly
recommended that in any code that you write, which
involves generating random numbers, you fix the seed for
your generation process. The consequence is that your code
will produce identical results each time (so long as we
call it from an outer loop that also has its random seed
fixed). This is a good practice while running experiments
with programs.
For evaluation, we will use another set of bandit instances in the autograder, and use their scores as is for 80% of the evaluation. For a particular test instance, the pass/fail criterion of the autograder is determined based on your regret lying within 1.5 times of the regret of our reference implementation. If your code produces an error, it will directly receive a 0 score in the autograded tasks. It will also get 0 marks if for any subtask whatsoever, the autograder takes over 20 minutes to run the subtask. The remaining part of the evaluation will be done based on your report, which includes plots, and explanation of your algorithms. See the exact details below.
In this first task, you will implement the sampling algorithms: (1) UCB, (2) KL-UCB, and (3) Thompson Sampling. This task is straightforward based on the class lectures. The instructions below tell you about the code you are expected to write.
Read task1.py
. It contains a sample
implementation of epsilon-greedy for you to understand
the Algorithm
class. You have to edit
the __init__
function to create any state
variables your algorithm needs, and
implement give_pull
and get_reward
. give_pull
is
called whenever it is the algorithm's decision to pick an
arm and it must return the index of the arm your algorithm
wishes to pull (lying in 0, 1, ...
self.num_arms-1). get_reward
is called
whenever the environment wants to give feedback to the
algorithm: your code can use this feedback to update the
data structures maintained by your agent. It will be
provided the arm_index
of the arm and
the reward
seen (0/1). Note that the
arm_index
will be the same as the one returned by the give_pull
function called earlier. For more
clarity, refer to single_sim
function in simulator.py
.
Once done with the implementations, you can run simulator.py
to see the regrets over different horizons. Save
the generated plot and add it to your report, with suitable commentary (ideally 4-5 lines describing what you see, what issues you faced, any surprising patterns). You may also run autograder.py
to evaluate
your algorithms on the provided test instances.
We describe the problem statement that was briefly discussed completely now. The algorithm is given the number
of arms of the bernoulli bandit num_arms
, a horizon
and a batch_size
. The give_pull
function will be
called horizon/batch_size
times (which can be assumed to be an integer). In every call, it must return the
next batch_size
number of pulls the algorithm wishes to make. The function must do so in a specific
format: it has to return two lists, one containing the arm indices that it wishes to pull, and the other
containing the number of times each of those indices must be pulled. For example, in a 10-armed bandit instance with batch_size
20, a possible return of the give_pull
function could be ([2, 4, 9], [10, 4, 6]). Note that your function
should generalize to arbitrary batch_sizes, as long as the given batch_size is a factor of the horizon (the batch_size could be just 1, or it could also be of the order of the horizon). The autograder/simulator
will proceed and pull these arms according to their respective counts, and then provide feedback to the
get_reward
function. The feedback is provided
as a dict, where the keys are the arm indices, and the
rewards are a numpy array of 0s and 1s that were seen. So
a possible input (arm_rewards
) to
the get_reward
function for the above batch
pull could be {2: np.array([1, 1, 1, 0, 1, 1, 0, 1, 0,
1]), 4: np.array([1, 1, 0, 0]), 9: np.array([0, 1, 0, 1,
0, 0])}. Again, you can read single_batch_sim
for more clarity. The regret is calculated over all the
pulls over the horizon.
Once done with your implementation, you can run simulator.py
to see the regrets for a fixed horizon over
different batch sizes. Save the generated plot and add it to your report, with apt captioning. Take 4-5 lines (or more) to explain the
trend you see, and justify your choice of the distribution of pulls in a given batch_size
. You may also
run autograder.py
to evaluate your algorithms on the provided test instances.
This task involves dealing with a bandit instance where the horizon is equal to the number of arms. So, for example, if there are 100 arms, then you are only allowed to pull 100 times. However, you are given that the arm means are distributed regularly (in arithmetic progression) between 0 and (1 - 1/numArms).
You need to come up with an algorithm to handle this situation effectively: can you do better than sampling each arm once? Implement your algorithm by editing the task3.py
file. The APIs you need to implement are essentially the same as task1.py
.
Once again, you can use simulator.py
to see regrets as a function of horizon. You may also run autograder.py
to evaluate your algorithm. Note that even if your algorithm passes the autograder tests, it might fail on the undisclosed tests that are used for evaluation. So you must not hardcode your method to make it work for only the given test instances. For this task, you will again plot regret against horizon (which is the same as the number of arms).
Your report needs to have all the plots that simulator.py
generates. There are 5 plots in total (3 for task 1 and 1 each for tasks 2 and 3). You do not need to include the epsilon-greedy plot in your report. In addition, you need to explain your method for each task. For task 1, explain your code for the three algorithms, and for tasks 2 and 3 explain your approach to the problem and provide justification to the trend seen in the plots.
You have to submit one tar.gz file with the name (roll_number).tar.gz. Upon extracting, it must produce a folder with your roll number as its name. It must contain a report.pdf
- the report as explained above, and three code files: task1.py
, task2.py
, task3.py
. You must also include a references.txt
file if you have referred to any resources while working on this assignment (see the section on Academic Honesty on the course web page).
The assignment is worth 15 marks. For every task, 4 marks are for the code and will be evaluated by the autograder, and 1 mark is for the report. The autograder used for evaluating will use a different set of test instances from the ones provided. Note that runtime errors in the script will lead to 0 marks for that test instance. You will also be given 0 marks for a test instance if your code takes more than 20 minutes to run for a test instance.
For tasks 1 and 2, you can expect that the number of arms in the bandit instances used in our undisclosed test cases will be at most 30, and similarly the horizon at most 10,000. (Note that longer horizons are used in your plots, which might take some time to generate). For Task 2, the batch size could be any legal number (that is, which perfectly divides the horizon) between 1 and the horizon. In Task 3, the horizon (equal to the number of arms) will be at most 10,000.
To test your implementation against the given testcases, run the autograder as follows: python3 autograder.py --task TASK
.
Here TASK
can be one of: 1
, 2
, 3
, or all
. If TASK
is 1
, then you also need to provide another argument: --algo ALGO
, where ALGO
can be one of: ucb
, kl_ucb
, thompson
, or all
.
Your submission is due by 11.55 p.m., Saturday, September 10. Finish working on your submission well in advance, keeping enough time to test your code, generate your data, compile the results, and upload to Moodle.
Your submission will not be evaluated (and will be given a score of zero) if it is not uploaded to Moodle by the deadline. Do not send your code to the instructor or TAs through any other channel. Requests to evaluate late submissions will not be entertained.
Your submission will receive a score of zero if your code does not execute on the cs747 docker container. To make sure you have uploaded the right version, download it and check after submitting (but well before the deadline, so you can handle any contingencies before the deadline lapses).
You are expected to comply with the rules laid out in the "Academic Honesty" section on the course web page, failing which you are liable to be reported for academic malpractice.