K-means algorithm

lzqkean 2013-07-22

展开全文

解释比较清楚的是Pattern recognition and machine learning的第九章Mixture Models and EM，这也是这本书里比较好懂的章节之一了。

这里把图的说明翻译一下：Illustration of the K-means algorithm using the re-scaled Old Faithful data set. (a) Green points denote the data set in a two-dimensional Euclidean space. The initial choices for centres μ1 and μ2 are shown by the red and blue crosses, respectively. (b) In the initial E step, each data point is assigned either to the red cluster or to the blue cluster, according to which cluster centre is nearer. This is equivalent to classifying the points according to which side of the perpendicular bisector of the two cluster centres, shown by the magenta line, they lie on. (c) In the subsequent M step, each cluster centre is re-computed to be the mean of the points assigned to the corresponding cluster. (d)–(i) show successive E and M steps through to final convergence of the algorithm.

使用k-means算法对标准化后的Old Faithful数据集聚类图示。(a)绿点表示数据集在二级的欧几里德空间，初始化的中心点u1和u2用红的和蓝的叉来分别表示，(b)在最初的E步骤中，每个点根据离哪个簇中心点近，被指定为属于红簇还是蓝簇，这等于将这些点根据垂直于两个中心点的分隔线的的哪边分类，它用紫色的线表示。(c)在接下来的M步骤，重新计算每个簇的中心点的平均值作为每个簇的中心点。(d)-(i)显示了连续的E和M步骤直到算法最后收敛。

(a)对应weka中的代码是：

for (int j = initInstances.numInstances() - 1; j >= 0; j--) {

instIndex = RandomO.nextInt(j 1);

hk = new DecisionTableHashKey(initInstances.instance(instIndex),

initInstances.numAttributes(), true);

if (!initC.containsKey(hk)) {

m_ClusterCentroids.add(initInstances.instance(instIndex));

initC.put(hk, null);

}

initInstances.swap(j, instIndex);

if (m_ClusterCentroids.numInstances() == m_NumClusters) {

break;

}

(b)对应的是：

for (i = 0; i < instances.numInstances(); i ) {

Instance toCluster = instances.instance(i);

int newC = clusterProcessedInstance(toCluster, true);

if (newC != clusterAssignments[i]) {

converged = false;

}

clusterAssignments[i] = newC;

}

(c)对应的是：

m_ClusterCentroids = new Instances(instances, m_NumClusters);

for (i = 0; i < m_NumClusters; i ) {

tempI[i] = new Instances(instances, 0);

}

for (i = 0; i < instances.numInstances(); i ) {

tempI[clusterAssignments[i]].add(instances.instance(i));

}

for (i = 0; i < m_NumClusters; i ) {

if (tempI[i].numInstances() == 0) {

// empty cluster

emptyClusterCount ;

} else {

moveCentroid(i, tempI[i], true);

}

The two phases of re-assigning data points to clusters and re-computing the cluster means are repeated in turn until there is no further change in the assignments (or until some maximum number of iterations is exceeded). Because each phase reduces the value of the objective function J, convergence of the algorithm is assured. However, it may converge to a local rather than global minimum of J. The convergence properties of the K-means algorithm were studied by MacQueen (1967).

重新分配数据点到簇和重新计算簇的均值一直重复到再也没有改变（或是达到最大迭代次数），因为每个步骤都在减小目标函数J的值，算法的收敛是可以保证的，但是，它可能收敛到一个局部最小值而不是全局最小值。

上面是函数J，它被称为distortion function(失真函数)，其中N是样本数，K是簇数，rnk是第n是否属于第k个簇，uk是第k个中心点的值，注意，这里是向量。也就是所有的样本与它所属于的中心点的欧几里德距离之和。

它的图的说明是：Plot of the cost function J given by (9.1) after each E step (blue points) and M step (red points) of the kmeans algorithm for the example shown in Figure 9.1. The algorithm has converged after the third M step, and the final EM cycle produces no changes in either the assignments or the prototype vectors. Cost function J在kmeans的E步骤(蓝点)和M步骤(红点)如图所示。算法在第3个M步骤后收敛，EM循环没有再改变点所属的簇和中心点的值。

Note that we have deliberately chosen poor initial values for the cluster centres so that the algorithm takes several steps before convergence. In practice, a better initialization procedure would be to choose the cluster centres μk to be equal to a random subset of K data points. It is also worth noting that the K-means algorithm itself is often used to initialize the parameters in a Gaussian mixture model before Section 9.2.2 applying the EM algorithm.

注意我们故意选择了比较差的初始点用做聚类中心点，所以算法用了几步才收敛，在实际中一个好的初始化过程会选择中心点μk等于一个随机的K个点的子集，值得一提的是K-means算法自己也常被用做初始化高斯混合模型的参数，在使用EM算法之前。

它的图的说明：Two examples of the application of the K-means clustering algorithm to image segmentation showing the initial images together with their K-means segmentations obtained using various values of K. This also illustrates of the use of vector quantization for data compression, in which smaller values of K give higher compression at the expense of poorer image quality. 两个k-mean聚类算法用于图片分割的例子，图上的k是表示所聚类的数，同时也说明了用这种方法压缩，超小的k值可以得到更高的压缩比，更差的图像效果。