In this assignment, you will perform Ridge and Lasso Regression using the old school method of gradient descent.
Standard Gradient Descent Algorithm
(Image source: https://cdn-images-1.medium.com/max/800/1*HrFZV7pKPcc5dzLaWvngtQ.png)
You are given the standard House Price Prediction Dataset. With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, the challenge is to predict the final price of each home. There are 1461 number of data points available that have been segregated to training data containing 1200 points and test data containing 261 points.
One part of this assignment entails coding up the gradient descent algorithm from scratch and applying it to ridge regression. In second part, we learn a new type of gradient descent to be applied to Lasso regression.
House Pricing Dataset
(Image source: https://storage.googleapis.com/kaggle-competitions/kaggle/5407/media/housesbanner.png)
The base code for this assignment is available in this compressed file. Below is the list of files present in the Lab5_base
directory.
File Name | Description |
tasks.py |
This file contains all the required functions that you need to complete. |
utils.py |
This file contains all helper functions that you can use. |
autograder.py |
This is used for testing your tasks 1,2,3,4. |
The data set is provided in the dataset
directory.
You will only have to write code inside the following functions.
preprocess
: This method takes as input the raw csv extracted data (contains strings, numbers, floats...) output from read_data
function and returns the processed data (containing all numerical values). For categorical string data you should use the one_hot_encode
function and for other numerical data use following normalisation.
$$ (X_{new_i}) = {\displaystyle \frac{X_{old_i} - \mu(X_{old})}{\sigma(X_{old})}}$$
ridge_grad_descent
: This method takes as input training data X
and with labels Y
along with hyper parameters such as learning rate, max_iterations, $\lambda$, and returns the trained weight vector in form of a numpy array. You can initialize the weights in any way. You have to call function grad_ridge to get the error gradient and update the weight vector according to the ridge loss function. R(w) is the ridge objective function w.r.t. w
and w is the weight vector
and x_i is i_th data input vector
and y_i is i_th output vector
and n is total number of input points
.grad_ridge
: This method takes as input the train input X
, train output Y
, $\lambda$ and weight vector W
and returns the $\frac{dR(w)}{dw}$. k_fold_cross_validation
: This method takes as input train data X
and train labels Y
, number of splits k along with a list of lambdas and algorithm on which we want to tune hyperparameter lambda to get best possible test SSE. In this function, you will first split the data into k equal parts, perform cross validation to get SSE score (on validation data) for each of the lambda. Plot this score w.r.t. lambdas and choose the lambda that minimizes the above scorecoord_grad_descent
: This method takes as input train data X
and train output Y
along with hyper parameters like max_iterations, lambda and returns the trained weight vector in form of a numpy array optimizing the Lasso loss function.L(w) is the lasso objective function w.r.t. w
.Differences between Ridge and Lasso regression
(Image source: https://qph.fs.quoracdn.net/main-qimg-2a88e2acc009fa4de3edeb51e683ca02.)
In this task, you will complete the preprocess
function in task.py
file. To test your code, run the following command.
python3 autograder.py 1
In this task, you will complete the ridge_grad_descent
and grad_ridge
function in task.py
file. You are free to change the default parameters like learning rate, epsilon, max_iter as you wish (as long as it passes autograder). To test your code, run the following command.
python3 autograder.py 2
In this task, you will complete the k_fold_cross_validation
function in task.py
file. To test your code, run the following command.
python3 autograder.py 3
Note: This is just sanity check test (1 Mark). You will get full marks for tuning the lambda appropriately for both ridge and lasso regression. Report the hyperparameters used in observations.txt. Also, report the final SSE obtained on Test data (261 points). Make use of plot_kfold
function to plot the SSE v/s lambda plot for ridge and lasso parts. Save them as ridge_kfoldcv.png
and lasso_kfoldcv.png
. Also, explain in observations.txt how does this plot help you tune the lambda
In this task, you will complete the coord_grad_descent
function in task.py
file. Make sure to write efficient code for this task (i.e 2000 iterations should take less than a min) for doing a fast cross validation. You are free to change the default parameter max_iter as you wish (as long as it passes autograder). To test your code, run the following command.
python3 autograder.py 4
In this task, answer the following question.
Is there something unusual with the solution of Lasso compared to the Ridge? Explain why such a thing would happen? Is using lasso advantageous compared to ridge. How?
You are expected to work on this assignment by yourself. You may
not consult with your classmates or anybody else about their
solutions. You are also not to look at solutions to this assignment or
related ones on the Internet. You are allowed to use resources on the
Internet for programming (say to understand a particular command or a
data structure), and also to understand concepts (so a Wikipedia page
or someone's lecture notes or a textbook can certainly be
consulted). However, you must list every resource you have
consulted or used in a file named references.txt
,
explaining exactly how the resource was used. Failure to list all
your sources will be considered an academic violation.
task.py
, you will have to fill out functions for all the tasks specified. observations.txt
, you will report the parameters (learning rate, max_iteration, lambda used) for both lasso and ridge that was used to obtain your min SSE on test data. Also, explain the plots and make sure to answer the Task 5 completely here.ridge_kfoldcv.png
and lasso_kfoldcv.png
.
Remember to test your solution before submission using the autograder provided.
python3 autograder.py [task_num]
task_num
can be one of {1,2,3,4, all}.
Place these four files (and additionally references.txt
) in a directory (la-5-rollno), and compress it to be la5-rollno.tar.gz where rollno is your roll number (say 12345678, then Final compressed file would be la5-12345678.tar.gz) and upload it on Moodle under Lab Assignment 5.