In this lab we will look at the vanilla k-means clustering algorithm.

k-means Algorithm : 
1. Define the initial clusters' centroids. This step can be done using different strategies. A very common one is to assign random values for the centroids of all groups. Another approach, known as Forgy initialisation, is to use the positions of $k$ different points in the data set as the centroids.
2. Assign each entity to the cluster that has the closest centroid. In order to find the cluster with the closest centroid, the algorithm must calculate the distance between all the points and each centroid.
3. Recalculate the centroids. Each component of the centroid is updated, and set to be the average of the corresponding components of the points that belong to the cluster.
4. Repeat steps 2 and 3 until points no longer change cluster.

Note:
In step 2, if more than one centroid is closest to a data point, you can break the ties arbitrarily.
In step 3, if any cluster has no points in it, just pass on the value of the centroid from the previous step.

Cluster Initialisation: Forgy
Forgy initialisation is one of the simplest ways to initialise the clustering algorithm. The Forgy method randomly chooses $k$ points from the data set and uses these as the initial centroids. This ensures that the initial clusters are uniformly spread out across the data.

Measuring Performance: Sum Squared Error (SSE)
k-means clustering tries to locally minimise the Sum Squared Error, where the error associated with each data point is taken as its Euclidean distance from the cluster centre.

Task: 
kmeans_clustering.py has code with few empty functions that implements k-means algorithm
Implement all of the following 5 functions :

distance_euclidean(p1, p2)
initialize_centroids(data, k)
kmeans_iteration_one(data, centroids)
hasconverged(old_centroids, new_centroids, epsilon)
performance_SSE(data, centroids, distance)

Note: Only add code between TODO and END TODO


Testing and datasets:
You can test your code by running this command 
`python3 kmeans_clustering.py ./datasets/<dataset-name>` 
on each of the datasets present in ./datasets

This gives you the plots for colour-coded clusters and performance plots
Note: Only 2-dimensiional dataset gives the cluster plots

You can also give several other parameters as the commandline arguments to tune the learning.
`python3 kmeans_clustering.py ./datasets/<dataset-name> -k <number-of-clusters> -m <max-iterations> -e <epsilon-for-covergence-check> -s <seed-for-random-generator>
Set k based on the data distribution and rerun. 
For example, k=4 for 4 clusters is most suitable based on the data distribution

Note:
This results presented by this algorithm are very much dependent on the initialisation of the centroids.
Run the following
`python3 kmeans_clustering.py ./datasets/4clusters.csv -s 0`
`python3 kmeans_clustering.py ./datasets/4clusters.csv -s 10`
You can see that clustering is properly done in the former but not in the latter.
For the same reason, clustering is not properly done in 3lines.csv with default seed
Whereas circles.csv cannot be clustered by the vanilla k-means at all.
