In this assignment you will implement, Howard's Policy
Iteration and Linear Programming to solve for an
optimal policy and the optimal value function of a
Markov Decision Problem (MDP). The first part of the
assignment requires you to implement the algorithms
and evaluate them on a set of test MDPs. In the second
part, you will use the algorithms you implemented to
solve a given MDP modelling problem. Lastly, you will
be required to write a report detailing your approach
and results.
This compressed directory
contains a data directory with sample data for both parts and
helper functions to visualise and test your code. Your code
will also be evaluated on instances other than the ones
provided. All the code you write for this assignment must be
in Python 3.9.6 (same version used in Programming Assignment
1). If necessary, you can install Python 3.9.6 for your system
from here. Your
code will be tested using a python virtual environment. To set
up the virtual environment, follow the instructions provided
in virtual-env.txt which is included in the compressed
directory linked above.
Given an MDP, your program must compute the optimal value function V* and an optimal policy π* by applying the algorithm that is specified through the command line. Create a python file called planner.py which accepts the following command-line arguments.
--mdp followed by a path to the input MDP file.--algorithm followed by one of hpi and lp. You must assign a
default value out of hpi and lp to allow the code to run without this argument.--policy followed by a policy file, for which the value function V^π is
to be evaluated.
The policy file has one line for each state, containing only a single integer giving the action.
python planner.py --mdp /home/user/temp/data/mdp-7.txt --algorithm hpipython planner.py --mdp /home/user/mdpfiles/mdp-5.txt --algorithm lppython planner.py --mdp /home/user/mdpfiles/mdp-5.txtpython planner.py --mdp /home/user/mdpfiles/mdp-5.txt --policy pol.txt
In the mdp folder in data directory, you are given eight MDP instances (4 each
for continuing and episodic tasks).
A correct solution for each MDP is also given in the same folder, which you can use to test your code.
Each MDP is provided as a text file in the following format.
numStates S
numActions A
end ed1 ed2 ... edn
transition s1 ac s2 r p
transition s1 ac s2 r p
. . .
. . .
. . .
transition s1 ac s2 r p
mdptype mdptype
discount gamma
The number of states S and the number of actions A will be integers greater than 1. There will be at
most 10,000 states, and at most 100 actions.
Assume that the states are numbered 0, 1, ..., S - 1, and the actions are numbered 0, 1, ..., A - 1.
Each line that begins with "transition"
gives the reward and transition probability corresponding to a transition, where R(s1, ac, s2) = r and
T(s1, ac, s2) = p. Rewards can be positive,
negative, or zero. Transitions with zero probabilities are not specified. You can assume that there will
be at most 350,000 transitions with
non-zero probabilities in the input MDP. mdptype will be one of continuing and
episodic. The discount factor gamma
is a real number between 0 (included) and 1 (included). Recall that gamma is a part of the MDP:
you must not change it inside your solver!
Also recall that it is okay to use gamma = 1 in episodic tasks that guarantee termination; you
will find such an example among the ones given.
To get familiar with the MDP file format, you can view and run generateMDP.py (provided in
the base directory), which is a python script
used to generate random MDPs. Specify the number of states and actions, the discount factor, type of MDP
(episodic or continuing), and the
random seed as command-line arguments to this file. Two examples of how this script can be invoked are
given below.
python generateMDP.py --S 2 --A 2 --gamma 0.90 --mdptype episodic --rseed 0python generateMDP.py --S 50 --A 20 --gamma 0.20 --mdptype continuing --rseed 0
The output of your planner must be in the following format and written to standard output, for both
modes of operation. Note that, for policy
evaluation, V* will be replaced by V^π, and π* will be replaced by π.
V*(0) π*(0)
V*(1) π*(1)
.
.
.
V*(S - 1) π*(S - 1)
In the data/mdp directory provided, you will find output files corresponding to the MDP files, which
have solutions in the format above. Since your
output will be checked automatically, make sure you have nothing printed to stdout other than the S
lines as above in sequence. If the testing code
is unable to parse your output, you will not receive any marks.
Note:
Your output has to be written to the standard output, not to any file.
For values, print at least 6 places after the decimal point. Print more if you'd like, but 6
(xxx.123456) will suffice.
If your code produces output that resembles the solution files: that is, S lines of the form
value + "\t" + action + "\n"
or even
value + " " + action + "\n"
you should be okay. Make sure you don't print anything else.
If there are multiple optimal policies, feel free to print any one of them. You are given a python
script to verify the correctness of your
submission format and solution: autograder.py. The following are a few examples that can
help you understand how to invoke this script.
python3 autograder.py --task 1 --> Tests the default algorithm set in planner.py on the all
the MDP instances given to you in the data/mdp directory.
python3 autograder.py --task 1 --algorithm all --> Tests all three algorithms + default
algorithm on the all the MDP instances give to you in the data/mdp directory.
python3 autograder.py --task 1 --algorithm hpi --> Tests only policy iteration algorithm on
the all the MDP instances given to you in the data/mdp directory.
The script assumes the location of the data directory to be in the same directory. Run the script to
check the correctness of your submission format.
Your code should pass all the checks written in the script. You will be penalised if your code does not
pass all the checks.
MDPs are general enough to model various sequential decision-making tasks. If we can solve MDPs, we can
thereby find optimal solutions to other tasks.
Here is one interesting example — think of it as a strategic twist on Blackjack. The goal is to collect
cards to get the
highest score possible, without letting the sum of your hand go over a fixed 'Threshold'. Go over, and you
bust! At each
step, you'll decide whether to risk drawing a new card, play it safe by swapping one, or lock in your
points and stop
the game. A special sequence of cards can also grant you a big bonus.
The agent's task is described in more detail for you below. Your aim is to first translate the given card
game specification into an MDP.
You will then use your solver from Task 1 to find an optimal policy for this MDP. The optimal policy is
then again to be integrated
into the specified card game to guide the agent to act optimally inside it.
Formally, the agent must play a sequential card game using a deck of 26 cards: numbers 1–13, each appearing twice
(once as a ♥ heart and once as a ♦ diamond).
The card game is defined by the following specifications.
Card Game Specifications
The game ends with a score of 38 because the agent chose to Stop with a hand sum of 28 (which is less than the threshold of 30), and the hand contained the special sequence (5, 6, 7), earning the 10-point bonus.
Card Game Specifications
| Player's Hand | Final Score |
|---|---|
| 3♥, 7♥, 7♦ (Bust) | 0 |
| 3♥, 4♦, 5♥, 6♦ (Bust) | 0 |
| 3♥, 6♦, 7♦ (Stop) | 3 + 6 + 7 = 16 |
| 3♦, 4♦, 5♥, 3♦ (Stop) | (3 + 4 + 5 + 3) + 4 (Bonus) = 19 |
encoder.py must take the
card game specification file as a command line
argument --game_config and print to standard output
an MDP in a format that can be inputted
to planner.py from task 1.
The game specification file will resemble the files present in data/gamespec/.
However, the specification file will not have the agent's hand present within it,
whereas testcases (described below) will have the agent's hand
initialised with a specific set of cards.
decoder.py must take two command line arguments, --value_policy and --testcase.
The value-policy file is the output of the planner, and the MDP file is the output of the encoder. The
testcase file has
the format such as the files present in data/test/. Each test case begins with the game
configuration
followed by game instances with the agent’s hand initialized to some set of cards (see examples in the
compressed directory).
Every testcase instant will only have valid hand's i.e sum of agent's cards will be less than bust limit.
If a card has already been drawn into the agent’s hand, it will no longer be present in the remaining deck.
The decoder must print integers to
standard output with each one on a new line, where each integer represents the optimal action the agent must
take in the
corresponding instance. If there are multiple optimal actions, print any one. The actions are numbered from
0 to 27.
python3 autograder.py --task 2.
To check your code while grading we keep a timelimit of 2 minutes per testcase.
data/gameconfig/ directory. The file will have the following format:
The agent must select one of the following actions at each time step.
Note that each card game task is encoded into its own MDP; that is, you do not have to create a single MDP that works for all game instances. Think carefully about how to define states, actions, transitions, rewards, and values so that an optimal policy for the MDP will indeed result in the maximum final score.
You also have been given a GUI for the card game that can be run in two modes: Manual Play, where you make the decisions, or Automated Mode, which uses a pre-generated policy to play optimally. (Note: The GUI is for context and demonstration purposes only, and the assignment does not require any code submission related to it.)
To play the game manually, run the GUI script directly from your terminal. You will be prompted to enter the game settings in the window.
python3 gui.py
The automated mode uses the game_setup.py script to orchestrate the entire process. It
internally calls
encoder.py, planner.py, and decoder.py to generate an optimal
policy file for
all valid hands before launching the GUI to visualize the results.
Your primary task is to add a new command-line argument, --automate, to your
decoder.py
script to handle the full policy generation. This argument will accept a file path to a configuration
file, which
uses the same simple format as the one for encoder.py. When the --automate
flag is used,
your script must calculate the optimal action for every possible valid hand—that is,
any
combination of cards whose value sum is less than the limit specified in the config file. The complete,
resulting
policy, consisting of each unique hand state and its corresponding action, must be printed directly to
standard
output.
Required Output Format for --automate is as follows:
Each line of the output must contain a unique hand state, followed by " -> ", and then the corresponding integer action (0-27). The cards within each hand state must be sorted to ensure consistent output.
1H -> 1
1D -> 14
1H 1D -> 0
2H -> 2
2H 2D -> 15
... (and so on for all valid hands)
For more clarity on how this feature is used in the pipeline, you can refer to the
game_setup.py and
gui.py code.
python3 game_setup.py --limit 20 --bonus 10 --sequence 3 4 5 --initial_state 3H 4D 7H
--limit: The maximum sum your hand can reach before busting.--bonus: The bonus points awarded for achieving the special sequence.--sequence: The three consecutive numbers that make up the special sequence.--initial_state: The specific set of cards the game should start with.
Prepare a short report.pdf file, in which you put your design decisions, assumptions, and
observations about the algorithms (if any) for
Task 1. Also describe how you formulated the MDP for Task 2. Note down any notable strategies or
insights you discovered about the generated actions from Task 2.
Place all the files in which you have written code in a directory named submission. Tar and Gzip the directory to produce a single compressed file (submission.tar.gz). It must contain the following files.
planner.pyencoder.pydecoder.pyreport.pdfreferences.txt
5 marks are reserved for Task 1 (2 marks for Howard's Policy Iteration, Linear Programming each, and 1
mark for the policy evaluation).
We shall verify the correctness by computing and comparing optimal policies for a large number of unseen
MDPs.
5 marks are allotted for the correctness of Task 2. We will run the encoder-planner-decoder sequence on
randomly generated cardgames. We will check if
the agent's actions are optimal in the cardgames.
2 marks are reserved for your report.
Your submission is due by 11.59 p.m., Sunday, October 12, 2025. Finish working on your submission well in
advance, keeping enough time to test your code,
compile the results, and upload to Moodle.
Your submission will not be evaluated (and will be given a score of zero) if it is not uploaded to
Moodle by the deadline. Do not send your code
to the instructor or TAs through any other channel. Requests to evaluate late submissions will not be
entertained.
Your submission will receive a score of zero if your code does not execute using the given python
virtual environment. To make sure you have
uploaded the right version, download it and check after submitting (but well before the deadline, so you
can handle any contingencies
before the deadline lapses).
You are expected to comply with the rules laid out in the "Academic Honesty" section on the course web
page, failing which you are liable to
be reported for academic malpractice.