Square error clustering methods

Next: Clustering by mixture decomposition Up: Partitional Clustering Previous: Partitional Clustering

Square error clustering methods

The most commonly used clustering strategy is based on the square-root error criterion.
Objective: To obtain a partition which, for a fixed number of clusters, minimizes the square-error where square-error is the sum of the Euclidean distances between each pattern and its cluster center.

Algorithm

Select an initial partition with k clusters. Repeat steps 2 through 5 until the cluster membership stabilizes.
Generate a new partition by assigning each pattern to its closest cluster center.
Compute new cluster centers as the centroids of the clusters.
Repeat steps 2 and 3 until an optimum value of the criterion is found.
Adjust the number of clusters by merging and splitting existing clusters or by removing small or outlier clusters.

The algorithm converges when the criterion function cannot be improved.

Initial partition

Select k seed points at random or by taking the centroid as the first seed point and the rest at a certain miminimum distance from this seed point.
Cluster the remaining points to the closest seed point.

Updating a partition
K-means:

In each pass(cycle) make an assignment of all patterns to the closest cluster center.
Recompute the cluster center after every new assignment is made.

Adjusting the number of clusters

Clustering algorithms can create new clusters or merge existing ones if certain conditions specified by the user are met.
Split a cluster if it has too many patterns and an unusually large variance along the feature with large spread.
Merge if they are sufficiently close.
Remove outliers from future consideration. (outliers are pattern/patterns that is sufficiently far removed from the rest of the data and hence suspected as a mistake in data entry.)

Performance of square-error clustering methods

Seeks compact hyper-ellipsoidal clusters and this can produce misleading results when the data do not occur in compact, hyper-ellipsoidal boundaries.
Exhibit inadequacies when the Euclidean measure is used to measure distance but the features are not on comparable scales.

Next: Clustering by mixture decomposition Up: Partitional Clustering Previous: Partitional Clustering

Miranda Maria Irene
Thu Apr 1 15:43:18 IST 1999