CS747 - Programming Assignment 2

Total Marks: 12

(Prepared by Dheeraj Kurukunda and Sriram Sambaraju)

In this assignment you will implement, Howard's Policy Iteration and Linear Programming to solve for an optimal policy and the optimal value function of a Markov Decision Problem (MDP). The first part of the assignment requires you to implement the algorithms and evaluate them on a set of test MDPs. In the second part, you will use the algorithms you implemented to solve a given MDP modelling problem. Lastly, you will be required to write a report detailing your approach and results.

This compressed directory contains a data directory with sample data for both parts and helper functions to visualise and test your code. Your code will also be evaluated on instances other than the ones provided. All the code you write for this assignment must be in Python 3.9.6 (same version used in Programming Assignment 1). If necessary, you can install Python 3.9.6 for your system from here. Your code will be tested using a python virtual environment. To set up the virtual environment, follow the instructions provided in virtual-env.txt which is included in the compressed directory linked above.

Task 1: MDP Planning (5 Marks)

Given an MDP, your program must compute the optimal value function V* and an optimal policy π* by applying the algorithm that is specified through the command line. Create a python file called planner.py which accepts the following command-line arguments.

--mdp followed by a path to the input MDP file.
--algorithm followed by one of hpi and lp. You must assign a default value out of hpi and lp to allow the code to run without this argument.
--policy followed by a policy file, for which the value function V^π is to be evaluated. The policy file has one line for each state, containing only a single integer giving the action.

Make no assumptions about the location of the MDP file relative to the current working directory; read it in from the path that will be provided. The algorithms specified above correspond to Howard's Policy Iteration and Linear Programming, respectively. Here are a few examples of how your planner might be invoked (it will always be invoked from its own directory).

python planner.py --mdp /home/user/temp/data/mdp-7.txt --algorithm hpi
python planner.py --mdp /home/user/mdpfiles/mdp-5.txt --algorithm lp
python planner.py --mdp /home/user/mdpfiles/mdp-5.txt
python planner.py --mdp /home/user/mdpfiles/mdp-5.txt --policy pol.txt

Notice that the last two calls do not specify which algorithm to use. For the third call, your code must have a default algorithm from hpi and lp that gets invoked internally. This feature will come handy when your planner is used in Task 2. It gives you the flexibility to use whichever algorithm you prefer. The fourth call requires you to evaluate the policy given as input (rather than compute an optimal policy). You are free to implement this operation in any suitable manner.

You are not expected to code up a solver for LP; rather, you can use available solvers as blackboxes. Your effort will be in providing the LP solver the appropriate input based on the MDP, and interpreting its output appropriately. Use the formulation presented in class. We require you to use the Python library PuLP. PuLP is convenient to use directly from Python code: here is a short tutorial and here is a reference. PuLP version 2.4 is included in the requirements.txt file for the environment.

You are expected to write your own code for Howard's Policy Iteration; you may not use any custom-built libraries that might be available for the purpose. You can use libraries for solving linear equations in the policy evaluation step but must write your own code for policy improvement. Recall that Howard's Policy Iteration switches all improvable states to some improving action; if there are two or more improving actions at a state, you are free to pick any one.

It is certain that you will face some choices while implementing your algorithms, such as in tie-breaking, handling terminal states, and so on. You are free to resolve them in any reasonable way; just make sure to note down your approach in your report.

Data

In the mdp folder in data directory, you are given eight MDP instances (4 each for continuing and episodic tasks). A correct solution for each MDP is also given in the same folder, which you can use to test your code.

MDP File Format

Each MDP is provided as a text file in the following format.

numStates S
numActions A
end ed1 ed2 ... edn
transition s1 ac s2 r p
transition s1 ac s2 r p
. . .
. . .
. . .
transition s1 ac s2 r p
mdptype mdptype
discount gamma

The number of states S and the number of actions A will be integers greater than 1. There will be at most 10,000 states, and at most 100 actions. Assume that the states are numbered 0, 1, ..., S - 1, and the actions are numbered 0, 1, ..., A - 1. Each line that begins with "transition" gives the reward and transition probability corresponding to a transition, where R(s1, ac, s2) = r and T(s1, ac, s2) = p. Rewards can be positive, negative, or zero. Transitions with zero probabilities are not specified. You can assume that there will be at most 350,000 transitions with non-zero probabilities in the input MDP. mdptype will be one of continuing and episodic. The discount factor gamma is a real number between 0 (included) and 1 (included). Recall that gamma is a part of the MDP: you must not change it inside your solver! Also recall that it is okay to use gamma = 1 in episodic tasks that guarantee termination; you will find such an example among the ones given.

To get familiar with the MDP file format, you can view and run generateMDP.py (provided in the base directory), which is a python script used to generate random MDPs. Specify the number of states and actions, the discount factor, type of MDP (episodic or continuing), and the random seed as command-line arguments to this file. Two examples of how this script can be invoked are given below.

python generateMDP.py --S 2 --A 2 --gamma 0.90 --mdptype episodic --rseed 0
python generateMDP.py --S 50 --A 20 --gamma 0.20 --mdptype continuing --rseed 0

Output File Format

The output of your planner must be in the following format and written to standard output, for both modes of operation. Note that, for policy evaluation, V* will be replaced by V^π, and π* will be replaced by π.

V*(0) π*(0)
V*(1) π*(1)
.
.
.
V*(S - 1) π*(S - 1)

In the data/mdp directory provided, you will find output files corresponding to the MDP files, which have solutions in the format above. Since your output will be checked automatically, make sure you have nothing printed to stdout other than the S lines as above in sequence. If the testing code is unable to parse your output, you will not receive any marks.

Note: Your output has to be written to the standard output, not to any file. For values, print at least 6 places after the decimal point. Print more if you'd like, but 6 (xxx.123456) will suffice. If your code produces output that resembles the solution files: that is, S lines of the form value + "\t" + action + "\n" or even value + " " + action + "\n" you should be okay. Make sure you don't print anything else. If there are multiple optimal policies, feel free to print any one of them. You are given a python script to verify the correctness of your submission format and solution: autograder.py. The following are a few examples that can help you understand how to invoke this script.

python3 autograder.py --task 1 --> Tests the default algorithm set in planner.py on the all the MDP instances given to you in the data/mdp directory.
python3 autograder.py --task 1 --algorithm all --> Tests all three algorithms + default algorithm on the all the MDP instances give to you in the data/mdp directory.
python3 autograder.py --task 1 --algorithm hpi --> Tests only policy iteration algorithm on the all the MDP instances given to you in the data/mdp directory.

The script assumes the location of the data directory to be in the same directory. Run the script to check the correctness of your submission format. Your code should pass all the checks written in the script. You will be penalised if your code does not pass all the checks.

Task 2: MDP Modeling: Strategic Card Play (5 Marks)

Goal

MDPs are general enough to model various sequential decision-making tasks. If we can solve MDPs, we can thereby find optimal solutions to other tasks. Here is one interesting example — think of it as a strategic twist on Blackjack. The goal is to collect cards to get the highest score possible, without letting the sum of your hand go over a fixed 'Threshold'. Go over, and you bust! At each step, you'll decide whether to risk drawing a new card, play it safe by swapping one, or lock in your points and stop the game. A special sequence of cards can also grant you a big bonus. The agent's task is described in more detail for you below. Your aim is to first translate the given card game specification into an MDP. You will then use your solver from Task 1 to find an optimal policy for this MDP. The optimal policy is then again to be integrated into the specified card game to guide the agent to act optimally inside it.

Formally, the agent must play a sequential card game using a deck of 26 cards: numbers 1–13, each appearing twice (once as a ♥ heart and once as a ♦ diamond). The card game is defined by the following specifications.

Threshold: Positive Integer (between 6 and 26, both inclusive)
Bonus: Positive Integer
Special Sequence: A sequence of 3 consecutive numbers (eg: 3 4 5) (suites not specified)

At each step, the agent may choose one of the actions described below.

Add: Draw the top card from the deck and add it to its hand.
Swap: Agent specifies one card currently in their hand to replace with the top card of the deck.
Stop: End the game voluntarily.

After each action, the remaining deck is reshuffled, so the top card drawn in the next step is always uniformly random from the remaining cards. The top card is not seen before the action is played.
The game ends if player chooses Stop action or goes Bust, i.e Sum of the cards in agent's hand >= maximum threshold at any point. The Final Score is determined as follows.

Bust: Final Score = 0, no Bonus applies
Stop: Final Score = Sum of the card values in the hand + specified bonus if applicable (described below).
Bonus condition: Hand must contain the specified 3 number consecutive sequence when game ends

Example Game Flow

Card Game Specifications

Threshold: 30
Bonus: 10
Special Sequence: 5 6 7
Initial Hand: 2♥, 5♦

2♥ 5♦
(Sum: 7)

Add (Draws 6♥) →

2♥ 5♦ 6♥
(Sum: 13)

Add (Draws 10♦) →

2♥ 5♦ 6♥ 10♦
(Sum: 23)

Swap 2♥ (Draws 7♥) →

5♦ 6♥ 10♦ 7♥
(Sum: 28)

Stop →

Final Score: 38

The game ends with a score of 38 because the agent chose to Stop with a hand sum of 28 (which is less than the threshold of 30), and the hand contained the special sequence (5, 6, 7), earning the 10-point bonus.

Here is an example, showing Final Score for various agent's hand when game ends

Example Final Score

Card Game Specifications

Threshold: 17
Bonus: 4
Special Sequence: 3 4 5

Player's Hand	Final Score
3♥, 7♥, 7♦ (Bust)	0
3♥, 4♦, 5♥, 6♦ (Bust)	0
3♥, 6♦, 7♦ (Stop)	3 + 6 + 7 = 16
3♦, 4♦, 5♥, 3♦ (Stop)	(3 + 4 + 5 + 3) + 4 (Bonus) = 19

Your task is to model this card game as an MDP and solve it using the planner you implemented in Task 1, so as to maximise the agent's final Score.

You have to implement an encoder and decoder for this card game.

The file encoder.py must take the card game specification file as a command line argument --game_config and print to standard output an MDP in a format that can be inputted to planner.py from task 1. The game specification file will resemble the files present in data/gamespec/. However, the specification file will not have the agent's hand present within it, whereas testcases (described below) will have the agent's hand initialised with a specific set of cards.

The file decoder.py must take two command line arguments, --value_policy and --testcase. The value-policy file is the output of the planner, and the MDP file is the output of the encoder. The testcase file has the format such as the files present in data/test/. Each test case begins with the game configuration followed by game instances with the agent’s hand initialized to some set of cards (see examples in the compressed directory). Every testcase instant will only have valid hand's i.e sum of agent's cards will be less than bust limit. If a card has already been drawn into the agent’s hand, it will no longer be present in the remaining deck. The decoder must print integers to standard output with each one on a new line, where each integer represents the optimal action the agent must take in the corresponding instance. If there are multiple optimal actions, print any one. The actions are numbered from 0 to 27.

Your encoder and decoder must work for all card game instances within the relevant parameter range. You are given 5 testcases with 10 hand instances that you can test your code on using the command python3 autograder.py --task 2. To check your code while grading we keep a timelimit of 2 minutes per testcase.

The input cardgame to the encoder is a text file with the game parameters specified line by line. Three example files are given in the data/gameconfig/ directory. The file will have the following format:

Line 1: Configuration:
Line 2: Maximum threshold for the hand sum
Line 3: Bonus value awarded if the specified sequence is present at the end of the game
Line 4: Consecutive 3 number sequence, space-separated

The encoding of actions at each step is as follows.

Actions

The agent must select one of the following actions at each time step.

0: Add — draw the top card from the deck and add it to the hand.
1–13: Swap a ♥ Heart card of the corresponding number (e.g., 1 = swap ♥1, 13 = swap ♥13).
14–26: Swap a ♦ Diamond card of the corresponding number (e.g., 14 = swap ♦1, 26 = swap ♦13).
27: Stop — end the game voluntarily.

Note that each card game task is encoded into its own MDP; that is, you do not have to create a single MDP that works for all game instances. Think carefully about how to define states, actions, transitions, rewards, and values so that an optimal policy for the MDP will indeed result in the maximum final score.

You also have been given a GUI for the card game that can be run in two modes: Manual Play, where you make the decisions, or Automated Mode, which uses a pre-generated policy to play optimally. (Note: The GUI is for context and demonstration purposes only, and the assignment does not require any code submission related to it.)

To play the game manually, run the GUI script directly from your terminal. You will be prompted to enter the game settings in the window.

python3 gui.py

The automated mode uses the game_setup.py script to orchestrate the entire process. It internally calls encoder.py, planner.py, and decoder.py to generate an optimal policy file for all valid hands before launching the GUI to visualize the results.

Your primary task is to add a new command-line argument, --automate, to your decoder.py script to handle the full policy generation. This argument will accept a file path to a configuration file, which uses the same simple format as the one for encoder.py. When the --automate flag is used, your script must calculate the optimal action for every possible valid hand—that is, any combination of cards whose value sum is less than the limit specified in the config file. The complete, resulting policy, consisting of each unique hand state and its corresponding action, must be printed directly to standard output.

Required Output Format for --automate is as follows:

Each line of the output must contain a unique hand state, followed by " -> ", and then the corresponding integer action (0-27). The cards within each hand state must be sorted to ensure consistent output.


    1H -> 1
    1D -> 14
    1H 1D -> 0
    2H -> 2
    2H 2D -> 15
    ... (and so on for all valid hands)

For more clarity on how this feature is used in the pipeline, you can refer to the game_setup.py and gui.py code.

python3 game_setup.py --limit 20 --bonus 10 --sequence 3 4 5 --initial_state 3H 4D 7H

Argument Breakdown

--limit: The maximum sum your hand can reach before busting.
--bonus: The bonus points awarded for achieving the special sequence.
--sequence: The three consecutive numbers that make up the special sequence.
--initial_state: The specific set of cards the game should start with.

Report (2 marks)

Prepare a short report.pdf file, in which you put your design decisions, assumptions, and observations about the algorithms (if any) for Task 1. Also describe how you formulated the MDP for Task 2. Note down any notable strategies or insights you discovered about the generated actions from Task 2.

Submission

Place all the files in which you have written code in a directory named submission. Tar and Gzip the directory to produce a single compressed file (submission.tar.gz). It must contain the following files.

planner.py
encoder.py
decoder.py
report.pdf
references.txt
Any other files required to run your source code

Note that you must also include a references.txt file if you have referred to any resources while working on this assignment (see the section on Academic Honesty on the course web page). Submit this compressed file on Moodle, under Programming Assignment 2.

Evaluation

5 marks are reserved for Task 1 (2 marks for Howard's Policy Iteration, Linear Programming each, and 1 mark for the policy evaluation). We shall verify the correctness by computing and comparing optimal policies for a large number of unseen MDPs.

5 marks are allotted for the correctness of Task 2. We will run the encoder-planner-decoder sequence on randomly generated cardgames. We will check if the agent's actions are optimal in the cardgames.

2 marks are reserved for your report.

Deadline and Rules

Your submission is due by 11.59 p.m., Sunday, October 12, 2025. Finish working on your submission well in advance, keeping enough time to test your code, compile the results, and upload to Moodle.

Your submission will not be evaluated (and will be given a score of zero) if it is not uploaded to Moodle by the deadline. Do not send your code to the instructor or TAs through any other channel. Requests to evaluate late submissions will not be entertained.

Your submission will receive a score of zero if your code does not execute using the given python virtual environment. To make sure you have uploaded the right version, download it and check after submitting (but well before the deadline, so you can handle any contingencies before the deadline lapses).

You are expected to comply with the rules laid out in the "Academic Honesty" section on the course web page, failing which you are liable to be reported for academic malpractice.