In this assignment, you will implement and compare different algorithms for sampling the arms of a stochastic multi-armed bandit. Each arm provides i.i.d. rewards from a fixed distribution. The objective is to minimise regret. The algorithms you will implement are epsilon-greedy exploration, UCB, KL-UCB, and Thompson Sampling.
This is a relatively straightforward assignment, which essentially puts to practice the algorithms we have discussed in class. The only departure from the class discussion is in the reward distributions of the arms. In class we assumed these distributions to be Bernoulli: that is, only generating 0's and 1's as rewards. In this assignment, the rewards can be arbitrary real numbers that lie in the interval [0, 1]. This is not a very significant difference, but one that you must handle in your algorithms anyway.
The server
directory comprises the code to
simulate a bandit, in other words the "environment". The server waits
for a client ("agent") to connect with it and start pulling arms in
sequence. For each pull, the server generates a reward in [0, 1] based
on the true mean of the arm pulled (known to the server, but unknown
to the client), which is communicated back to the agent. Each
communication from the server also contains the total number of pulls
performed so far. Thus, each server message is of the form
"reward,pulls". Agent-server communication happens through a TCP
connection. The server is coded in C++, and compiles into a binary
called bandit-environment
on running
make
. It can be started with a shell script
called startserver.sh
.
The data
directory contains six bandit instances,
each with either 5 or 25 arms, and each with the reward
distributions of its arms being bernoulli
,
histogram
, or
betaDistribution
(all arms in a bandit instance have
the same family of distributions). The data provided in the
instance files, along with the code that consumes them in the
server, should make it clear exactly how stochastic rewards are
generated in each of these cases. Do create and experiment with
other instances for testing your algorithms.
The client
directory is provided to you as an
example of what your submission must achieve. The client is the agent
that implements sampling algorithms. The agent provided to you merely
samples the arms in a round-robin fashion: you will have to implement
more efficient sampling algorithms. Observe that the command line
parameters to the client include --algorithm
(which can
take
values epsilon-greedy
, UCB
, KL-UCB
,
and Thompson-Sampling
(all case-sensitive)),
and --epsilon
(which will be a real number in [0,
1]). The agent provided is coded in C++: it compiles
using make
into a binary
called bandit-agent
.
Run startexperiment.sh
in the top level directory to
run an experiment. It will start the server and the client in
sequence. The server writes out a log file based on its interaction
with the client; the per-step reward and regret are available from
this log file. Run with different algorithms, random seeds, bandit
instances, and horizons to get familiar with the space of
experiments.
You will submit two items: (1) working code for your agent, which implements different algorithms, and (2) a report containing graphs of regret measured at different horizons, as well as your observations from the experiments.
You must submit a directory called client
, which contains
all the source and executable files of your agent. The directory must
contain a script named startclient.sh
, which must take
in all the command line arguments that are accepted by the example
client provided. You are free to build upon the provided C++ agent, or
otherwise implement an agent in any programming language of your
choice. The hostname and port number should suffice for setting up a TCP
connection with the server.
Your code will be tested by running an experiment that calls
startclient.sh
in your directory. Before you submit, make
sure you can successfully run startexperiment.sh
on
the sl2
machines (sl2-X.cse.iitb.ac.in
,
where X
is the machine number, for example 88).
For each of the six bandit instances provided in the
data
directory; each algorithm from
{epsilon-greedy
, UCB
, KL-UCB
, Thompson-Sampling
};
each horizon in {1000, 10000}, generate at least 100 runs by varying
the random seed. Run at least 5 versions
of epsilon-greedy
on each instance, with different values
of epsilon
, in order to demonstrate the effect of this
parameter. Create a separate plot for each bandit instance, with the
number of pulls (up to the horizon) on the x axis and the (expected
cumulative) regret on the y axis. Plot a curve corresponding to each
algorithm (thus, a total of 8 curves). Each point must be the average
over all runs for the corresponding setting of instance, horizon, and
algorithm. epsilon-greedy
Place your plots in a single pdf file
called report.pdf
, and along with the plots, note down
your observations about the results obtained. What patterns do they
display? Do your results validate your intuition about the effect of
epsilon? Did you notice anything unexpected? Do you have any ideas to
further improve performance?. The report should mention what changes
you made (if any) to account for non-Bernoulli rewards. It should also
include other relevant details about your implementation of the
algorithms (for example, if you tuned any parameters).
Place all the data generated from your experiments, as well
as report.pdf
, within a directory
called report
, and place this directory
inside client
.
If you have used any external code snippets or libraries in your code, make sure you provide references in your report. It is okay to use public code for parts of your agent such as the network communication module, or, say, libraries for random number generation and sorting. However, the logic used for sampling the arms must entirely be code that you have written.
In summary: you must submit your client
directory (compressed as
client.tar.gz
) through Moodle. The directory must contain
startclient.sh
, along with all the sources and
executables for the agent. The client
directory must also
contain a directory called report
, in which you provide
all the data generated by your experiments, and a file
called report.pdf
.
Your algorithms will be tested on different bandit instances and horizons to verify that they perform as expected. Their performance on the two provided bandit instances, as well as your accompanying report, will carry 6 marks. The performance of the algorithms on other unseen bandit instances will carry 4 marks. For each such (algorithm, instance, horizon) configuration, a large number of runs (say 100) will be conducted by varying the random seed passed to the server and agent. The average regret over these runs will be recorded. The number of bandit arms will be between 2 and 50; the horizon will be between 1 and 100,000.
The TAs and instructor may look at your source code and notes to corroborate the results obtained by your agent, and may also call you to a face-to-face session to explain your code.
Your submission is due by 11.55 p.m., Sunday, August 19. Finish working on your submission well in advance, keeping enough time to upload it to Moodle. Your submission will not be evaluated (and will be given a score of zero) if it is not uploaded to Moodle by the deadline.
Before submission, make sure that your code runs for a variety of
experimental conditions. Test your code on the sl2
machines even while you are developing it: do not postpone this step
to the last minute. If your code requires any special libraries to
run, it is your responsibility to get those libraries working
on the sl2
machines (go through
the CSE bug tracking
system to make a request to the system administrators). Make sure
that you upload the intended version of your code to Moodle (after
uploading, download your submission and test it on
the sl2
machines to make sure it is the correct version).
You will not be allowed to alter your code in any way after the
submission deadline. In short: your grade will be completely
determined by your submission on Moodle at the time of the
deadline. Play safe by having it uploaded and tested at least a few
hours in advance.
You must work alone on this assignment. Do not share any code (whether yours or code you have found on the Internet) with your classmates. Do not discuss the design of your agent with anybody else. Do not see anybody else's code or report.