The most commonly used clustering strategy is based on the square-root
error criterion.

*Objective:* To obtain a partition which, for a fixed number of clusters, minimizes the square-error where square-error is the sum of the Euclidean distances between each pattern and its cluster center.

Algorithm

- Select an initial partition with k clusters. Repeat steps 2 through 5 until the cluster membership stabilizes.
- Generate a new partition by assigning each pattern to its closest cluster center.
- Compute new cluster centers as the centroids of the clusters.
- Repeat steps 2 and 3 until an optimum value of the criterion is found.
- Adjust the number of clusters by merging and splitting existing clusters or by removing small or outlier clusters.

Initial partition

- Select k seed points at random or by taking the centroid as the first seed point and the rest at a certain miminimum distance from this seed point.
- Cluster the remaining points to the closest seed point.

Updating a partition

K-means:

- In each pass(cycle) make an assignment of all patterns to the closest cluster center.
- Recompute the cluster center after every new assignment is made.

Adjusting the number of clusters

- Clustering algorithms can create new clusters or merge existing ones if certain conditions specified by the user are met.
- Split a cluster if it has too many patterns and an unusually large variance along the feature with large spread.
- Merge if they are sufficiently close.
- Remove outliers from future consideration. (outliers are pattern/patterns that is sufficiently far removed from the rest of the data and hence suspected as a mistake in data entry.)

Performance of square-error clustering methods

- Seeks compact hyper-ellipsoidal clusters and this can produce misleading results when the data do not occur in compact, hyper-ellipsoidal boundaries.
- Exhibit inadequacies when the Euclidean measure is used to measure distance but the features are not on comparable scales.

Thu Apr 1 15:43:18 IST 1999