Discussion
The two key features of k-means which make it efficient are often regarded as its biggest drawbacks:
- Euclidean distance is used as a metric and variance is used as a measure of cluster scatter.
- The number of clusters k is an input parameter: an inappropriate choice of k may yield poor results. That is why, when performing k-means, it is important to run diagnostic checks for determining the number of clusters in the data set.
- Convergence to a local minimum may produce counterintuitive ("wrong") results (see example in Fig.).
A key limitation of k-means is its cluster model. The concept is based on spherical clusters that are separable in a way so that the mean value converges towards the cluster center. The clusters are expected to be of similar size, so that the assignment to the nearest cluster center is the correct assignment. When for example applying k-means with a value of onto the well-known Iris flower data set, the result often fails to separate the three Iris species contained in the data set. With, the two visible clusters (one containing two species) will be discovered, whereas with one of the two clusters will be split into two even parts. In fact, is more appropriate for this data set, despite the data set containing 3 classes. As with any other clustering algorithm, the k-means result relies on the data set to satisfy the assumptions made by the clustering algorithms. It works well on some data sets, while failing on others.
The result of k-means can also be seen as the Voronoi cells of the cluster means. Since data is split halfway between cluster means, this can lead to suboptimal splits as can be seen in the "mouse" example. The Gaussian models used by the Expectation-maximization algorithm (which can be seen as a generalization of k-means) are more flexible here by having both variances and covariances. The EM result is thus able to accommodate clusters of variable size much better than k-means as well as correlated clusters (not in this example).
Read more about this topic: K-means Clustering
Famous quotes containing the word discussion:
“We cannot set aside an hour for discussion with our children and hope that it will be a time of deep encounter. The special moments of intimacy are more likely to happen while baking a cake together, or playing hide and seek, or just sitting in the waiting room of the orthodontist.”
—Neil Kurshan (20th century)
“What chiefly distinguishes the daily press of the United States from the press of all other countries is not its lack of truthfulness or even its lack of dignity and honor, for these deficiencies are common to the newspapers everywhere, but its incurable fear of ideas, its constant effort to evade the discussion of fundamentals by translating all issues into a few elemental fears, its incessant reduction of all reflection to mere emotion. It is, in the true sense, never well-informed.”
—H.L. (Henry Lewis)
“If the abstract rights of man will bear discussion and explanation, those of women, by a parity of reasoning, will not shrink from the same test: though a different opinion prevails in this country.”
—Mary Wollstonecraft (17591797)