The focus of this lab is k-means clustering. We will look at the vanilla algorithm, assess its performance, and develop some better variants. Finally, we will use clustering for classifying the MNIST data set.
Some starter code and data sets have been provided for you here: kmeans.tar.gz. Uncompress this directory to find the following files.
kmeans.py
This file implements the k-means clustering algorithm. This is the only file in which you need to write code. You need to implement the following 7 functions in the file, under different tasks (described below).
distance_euclidean(p1, p2)
distance_manhattan(p1, p2)
initialization_kmeansplusplus(data, distance, k)
iteration_one(data, means, distance)
hasconverged(old_means, new_means, epsilon=1e-1)
iteration_many(data, means, distance, maxiter, epsilon=1e-1)
performance_SSE(data, means, distance)
Tip: Each function is labeled with the task number. If you are working on Task 1, search the file for task1
.
After implementing the functions, you can execute your code as follows; the command-line options are explained below.
python kmeans.py --input datasets/flower.csv --iter 100 --epsilon 1e-3 --init forgy --dist euclidean --k 8
Argument | Description | |
---|---|---|
--seed |
int | The RNG seed to be used. |
--input |
str | Data set filename. |
--output |
str | Output filename. |
--iter |
int | Maximum number of iterations of the k-means algorithm to perform (may stop earlier if convergence is achieved). |
--epsilon |
float | Minimum distance the cluster centroids move between two consecutive iterations for the algorithm to continue. |
--init |
str | The initialisation algorithm to be used, from among {forgy and kmeans++}. |
--dist |
str | The distance metric to be used, from among {euclidean and manhattan}. |
--k |
int | The number of clusters to use. |
--numexperiments |
int | The number of experiments to run (each run uses a different random seed). |
autograder.py
Execute this script as python autograder.py
[task-number]
. Running python autograder.py all
will give you an estimate of your final grade. We will use our own
test cases and judgement before finalising the grade.
mnistplot.py
This is a tool to visualise your MNIST cluster means, that you will
need in Task 4. Execute it as python mnistplot.py mnist.txt
.
datasets/*
This directory contains the data sets you will be using during this lab. Each line in the files is a comma separated list of numbers, representing a point in .
100.csv
, 1000.csv
and 10000.csv
were sampled from multivariate Gaussians with diagonal covariance. They have data in , , and , respectively.
Submission details are at the bottom.
k-means clustering is an unsupervised learning algorithm, which aims to cluster the given data set into clusters.
Formally, the clustering problem it tries to solve can be stated as:Given a set of observations , partition the observations into sets so as to minimise the within-cluster sum of squares (WCSS) (sum of distances of each point in the cluster to its cluster centre). In other words, its objective is to find:
where is the mean of points in .
Finding the optimal solution to this problem is NP-Hard for general n, d, and k. k-means clustering presents an iterative solution to the above clustering problem, that provably converges to a local optimum.
Forgy initialisation is one of the simplest ways to initialise the k-means algorithm. The Forgy method randomly chooses points from the data set and uses these as the initial means. This ensures that the initial clusters are uniformly spread out across the data. The code for the initialisation is already provided to you in the initialization_forgy
function.
k-means clustering tries to locally minimise the Sum Squared Error, where the error associated with each data point is taken as its Euclidean distance from the cluster centre.
Implement all of the following 6 functions in kmeans.py
.
distance_euclidean(p1, p2)
distance_manhattan(p1, p2)
iteration_one(data, means, distance)
hasconverged(old_means, new_means, epsilon)
iteration_many(data, means, distance, numiter, epsilon)
performance_SSE(data, means, distance)
Test your code by running this command.
python kmeans.py --input datasets/flower.csv --iter 100 --epsilon 1e-3 --init forgy --dist euclidean --k 8 --seed $RANDOM
Try different values of , and try both Euclidean and Manhattan distances.
Evaluation: Each correctly implemented function will fetch you mark.
Test with python autograder.py 1
.
Test your code on the following data sets.
datasets/100.csv
: Use , numexperiments
datasets/1000.csv
: Use , numexperiments
datasets/10000.csv
: Use , numexperiments
Use epsilon and the Euclidean distance metric for every experiment. Here is an example.
python kmeans.py --epsilon 1e-2 --init forgy --dist euclidean --input datasets/100.csv --k 2 -numexperiments 100
Answer the following 2 questions in a file named solutions.txt
.
datasets/garden.csv
, with different values of . Look at the performance plots and answer whether the SSE of the k-means clustering algorithm ever increases as the iterations are performed. [1 mark]3lines.png
and mouse.png
. Manually draw cluster boundaries around the 3 clusters visible in each file (no need to submit the hand drawn clustering). Test the k-means algorithm with different random seeds on the data sets datasets/3lines.csv
and datasets/mouse.csv
. How does the algorithm's clustering compare with the clustering you did by hand? Why do you think this happens? [1 mark]Evaluation: The text questions carry marks as specified. Make sure to write clean, succinct answers.
It is worth noting that k-means can sometimes perform poorly! Test your algorithm on the datasets/rectangle.csv
data set several times, using . Depending on the initialisation, k-means can converge to poor clusterings.
The clustering performance of the k-means algorithm is sensitive to the initial cluster means. A poor initialisation can lead to arbitrarily poor clustering. The k-means++ initialisation algorithm addresses this problem; standard k-means clustering is performed after the initialisation step. With the k-means++ initialisation, the algorithm is guaranteed to find a solution that is competitive to the optimal k-means solution in expectation.
The intuition behind this approach is that spreading out the k initial cluster centres is a good idea: the first cluster centre is chosen uniformly at random from the data points, after which each subsequent cluster centre is chosen from the remaining data points with a probability proportional to the point's squared distance to the point's closest existing cluster centre.
The algorithm is as follows.
This seeding method yields considerable improvement in both the speed of convergence, and the final error of k-means.
Implement the following function in kmeans.py
.
initialization_kmeansplusplus(data, distance, k)
Note: You are expected to provide elaborate comments along with your code (for the function). Your marks depend on whether the TAs are able to understand your code and establish its correctness.
Test with python autograder.py 3
.
Use your code by running the following command (this is for you to check that the code is working, not used for evaluation).
python kmeans.py --input datasets/flower.csv --epsilon 1e-2 --init kmeans++ --dist euclidean --k 8
After implementing your code, test it on these data sets.
datasets/100.csv
: Use , numexperiments
datasets/1000.csv
: Use , numexperiments
datasets/10000.csv
: Use , numexperiments
Use epsilon and the Euclidean distance metric for every experiment.
Answer the following question in the file solutions.txt
.
Evaluation: Correct implementation of the kmeans++ function will fetch you mark. The text question is worth marks.
Notice how:
Supervised machine learning techniques require a data set with a large number of training examples and labels. However, getting labels can be extremely expensive in practice. Unsupervised learning algorithms handle this issue by not requiring any labels at all. Of course, they will usually not be as effective as supervised learning procedures, when labels are indeed available.
Since k-means clustering is an unsupervised algorithm, we will now use it to classify the MNIST data set.
Procedure:
This procedure enables us to achieve a decent classification accuracy, by hand labeling only a few images (as compared to thousands)!
Template File: kmeans.py
Data set: datasets/mnist.csv
Run your algorithm on the MNIST data set as follows.
python kmeans.py --input datasets/mnist.csv --iter 100 --epsilon 1e-2 --init kmeans++ --dist euclidean --k 10 --output mnist.txt
Plot the so found cluster centres by executing this command.
python mnistplot.py mnist.txt
Look at the plots and find out a good mean for each of the 10 digits. Compile these means into the file mnistmeans.txt
. You will have to run the clustering algorithm several times to get satisfactory means. Use the random seed to get different means. Naturally, you should not use the labels of the points in any way while clustering them: that is, your clustering algorithm should run on the entire (unlabeled) data set.
File Format for mnistmeans.txt
:
The -th line must contain the cluster mean for the -st digit.
Each mean is represented by a comma separated list of floats.
Evaluation: Your cluster means will be used to classify the MNIST data set.
accuracy | marks |
---|---|
accuracy | |
accuracy | |
otherwise |
Test with python autograder.py 4
.
In practice, you would want to choose multiple cluster means per class instead of just one, for increased accuracy.
Your submission directory must be named as la9-[rollnumber]
. It must contain the following files.
kmeans.py
solutions.txt
mnistmeans.txt
(do not submit mnist.txt by mistake)Do not include any other files.
Compress your directory into la9-[rollnumber].tar.gz
, and upload it to Moodle under Lab Assignment 9.