CS725 (Autumn 2021): Programming Assignment

This assignment is due by 11.55 pm on September 9, 2021. Late submissions will be allowed until 11.55 pm on September 10, 2021, with a 10% reduction in the overall marks.

Please read the following important instructions before getting started on the assignment.

This is a group assignment. Please form groups of 3-5 students. One representative from the team can submit the final archive file (mentioned below) on Moodle.
This assignment is entirely programming-based. The task is hosted on Kaggle. Click here for further instructions on how to access the Kaggle task.
1. Go to the Kaggle site.
2. Create a new login using your roll number. It is important that you use your roll number and nothing else, so that it appears with the correct identifying information on the Kaggle leaderboard.
3. Details of the task are available here.
4. Please contact the TAs if you need any help setting this up.
Your final submission should be a .zip file or .tgz bundle of a directory organized exactly as described here and should be submitted on Moodle.
The directory submission should be organized as follows and submitted on Moodle as a .zip file or a .tgz file (using the command:
tar -czf submission.tgz submission)
```
submission/
    |
    +-LR.py
    +-report.pdf (contains plots/numbers)
    
```
If you are new to Python and would benefit from some practice, your TAs have put together a few simple questions that make use of the numpy and pandas libraries that you will need for the assignment.
Numpy questions
Pandas questions
Contact Harsh if you have any questions in this section. Feel free to skip this and move directly to the main tasks below if you are already familiar with these Python libraries.

The Effectiveness of Linear Regression Models

Forest fires have become an increasingly common occurrence across the world. Most recently, raging forest fires in Turkey caused huge devastation. In this assignment, you will work with a dataset with fourteen different attributes describing fires in Australia. Given these features, the task would be to predict a real valued frp that refers to "Fire Radiative Power". More details about these data fields are available here.

As mentioned earlier, this task is hosted at Kaggle. The data sets train.csv, dev.csv and test.csv are available for download on the Kaggle page.

Starter Code: Click here to download starter code in LR.py. You will find placeholders for functions which you will need to fill in. Please carefully go through the comments inside the body of each function. Important: Note that you will need to adhere to the structure in the starter code. Please do not modify argument list of functions defined in LR.py unless stated explicitly in the comments inside the function body.

Please note that for all subsequent questions that ask for losses or scores to be reported, you will lose points if we cannot reproduce the exact number ($\pm \epsilon$) you report when we run your code.

Solving Linear Regression. Implement the (A) closed form/analytical solution and (B) an iterative solution using minibatch gradient descent for L2-regularized linear regression that minimizes the mean squared error (MSE) function.
For a data set, $\mathcal{D} = \{(x_i, y_i\}_{i=1}^{N}$, of $N$ training samples where each $x_i \in \mathbb{R}^d$, the linear regression model is defined as: \[ f_{w}(x) = \mathbf{w}^Tx \] and the batch gradient descent weight update rule for the $k^{\text{th}}$ element in $\mathbf{w}$, $w_k$, is: \[ w_k = w_k - \eta \frac1N \sum_{i=1}^N (f_{w}(x_i) - y_i)x_{i,k} \] where $\eta$ is the learning rate and $x_{i,k}$ is the value of the $k^{\text{th}}$ dimension in $x_i$.
In LR.py, once you complete the function definitions, the print statements on lines 256 and 272 will print the training and development set losses obtained using the analytical solution and the gradient descent solution, respectively. Within submission/report.pdf, report your best MSE losses on the development set obtained using the analytical solution and the gradient descent solution. Lines 263--267 in do_gradient_descent are hyperparameters that are set to default values; you can tune them and set them to any values that you determine to be the best fit for the task. [5+5 points]

Gradient descent stopping criteria. What stopping criterion did you use for convergence in gradient descent? Write this down in submission/report.pdf. [2 points]
Extra Credit: Early stopping is a simple form of regularization adopted in gradient descent optimizations. The main idea is as follows: Save the model parameters every time performance on the development set improves. When the training completes, return these parameters instead of the parameters in the last epoch. As an extra credit task, implement early stopping as part of gradient descent. Note there is a stub for early_stopping inside LR.py. Within submission/report.pdf, report MSE losses on dev.set instances with and without the use of early stopping. [3 points]

Effect of regularization. Setting C to 0 in do_gradient_descent yields the unregularized least squares solution. Plot MSE on the dev.set instances for different values of C (including C=0). Include this plot within submission/report.pdf. [4 points]

Basis Functions. Implement two different basis functions that will be applied to your input features with the L2-regularized model and optimized using gradient descent. Add your implementations of the basis functions within get_features. Describe your choice of basis functions in submission/report.pdf and report the MSE on the development set samples using both basis functions. Implement these functions such that we can easily comment out this block of code for Question 1 above that does not make use of any basis functions. Within comments, demarcate the block of code to be commented out to extract features without invoking any basis functions. [4 points]

Training Plots. Create subsets of the training data containing the first 5000, 10000, 15000, 20000 instances in train.csv. Create a plot where the X-axis is the size of the training set (5000, 10000, 15000, 20000, full) and the Y-axis is the mean-square error on the dev.csv instances. A placeholder function plot_trainsize_losses has been defined for you to fill out for this question. You can use existing libraries (like matplotlib) to create this plot and add it to submission/report.pdf. [3 points]

Feature Importance. Of the fourteen features, which would you consider the most important and the least important for the frp prediction? Describe how you identified these two features in submission/report.pdf. [2 points]

Climb the Leaderboard. The objective for this last part is to build the best possible linear regression model using any enhancements you like. Submit your target predictions for the instances in test.csv to Kaggle so that your roll number appears on both the "Public Leaderboard" (and eventually the "Private Leaderboard" after the assignment concludes). (In your output file, do make sure that you maintain the same order of the samples that appear in test.csv. Check the Kaggle page for a sample submission file.) Top-scoring performers on the "Private Leaderboard" (with a suitable threshold determined after the submission date) will be awarded extra credit points. Describe your main innovations in submission/report.pdf. (The exact breakdown of the five points for this question will be announced soon.) [5 points]