解释比较清楚的是Pattern recognition and machine learning的第九章Mixture Models and EM,这也是这本书里比较好懂的章节之一了。
这里把图的说明翻译一下:Illustration of the K-means algorithm using the re-scaled Old Faithful da 使用k-means算法对标准化后的Old Faithful数据集聚类图示。(a)绿点表示数据集在二级的欧几里德空间,初始化的中心点u1和u2用红的和蓝的叉来分别表示,(b)在最初的E步骤中,每个点根据离哪个簇中心点近,被指定为属于红簇还是蓝簇,这等于将这些点根据垂直于两个中心点的分隔线的的哪边分类,它用紫色的线表示。(c)在接下来的M步骤,重新计算每个簇的中心点的平均值作为每个簇的中心点。(d)-(i)显示了连续的E和M步骤直到算法最后收敛。 (a)对应weka中的代码是: for (int j = initInstances.numInstances() - 1; j >= 0; j--) { instIndex = RandomO.nextInt(j 1); hk = new DecisionTableHashKey(initInstances.instance(instIndex), initInstances.numAttributes(), true); if (!initC.containsKey(hk)) { m_ClusterCentroids.add(initInstances.instance(instIndex)); initC.put(hk, null); } initInstances.swap(j, instIndex);
if (m_ClusterCentroids.numInstances() == m_NumClusters) { break; } } (b)对应的是: for (i = 0; i < instances.numInstances(); i ) { Instance toCluster = instances.instance(i); int newC = clusterProcessedInstance(toCluster, true); if (newC != clusterAssignments[i]) { converged = false; } clusterAssignments[i] = newC; } (c)对应的是: m_ClusterCentroids = new Instances(instances, m_NumClusters); for (i = 0; i < m_NumClusters; i ) { tempI[i] = new Instances(instances, 0); } for (i = 0; i < instances.numInstances(); i ) { tempI[clusterAssignments[i]].add(instances.instance(i)); } for (i = 0; i < m_NumClusters; i ) { if (tempI[i].numInstances() == 0) { // empty cluster emptyClusterCount ; } else { moveCentroid(i, tempI[i], true); } } The two phases of re-assigning da 重新分配数据点到簇和重新计算簇的均值一直重复到再也没有改变(或是达到最大迭代次数),因为每个步骤都在减小目标函数J的值,算法的收敛是可以保证的,但是,它可能收敛到一个局部最小值而不是全局最小值。 上面是函数J,它被称为distortion function(失真函数),其中N是样本数,K是簇数,rnk是第n是否属于第k个簇,uk是第k个中心点的值,注意,这里是向量。也就是所有的样本与它所属于的中心点的欧几里德距离之和。
它的图的说明是:Plot of the cost function J given by (9.1) after each E step (blue points) and M step (red points) of the kmeans algorithm for the example shown in Figure 9.1. The algorithm has converged after the third M step, and the final EM cycle produces no changes in either the assignments or the prototype vectors. Cost function J在kmeans的E步骤(蓝点)和M步骤(红点)如图所示。算法在第3个M步骤后收敛,EM循环没有再改变点所属的簇和中心点的值。 Note that we have deliberately chosen poor initial values for the cluster centres so that the algorithm takes several steps before convergence. In practice, a better initialization procedure would be to choose the cluster centres μk to be equal to a random subset of K da 注意我们故意选择了比较差的初始点用做聚类中心点,所以算法用了几步才收敛,在实际中一个好的初始化过程会选择中心点μk等于一个随机的K个点的子集,值得一提的是K-means算法自己也常被用做初始化高斯混合模型的参数,在使用EM算法之前。
它的图的说明:Two examples of the application of the K-means clustering algorithm to image segmentation showing the initial images together with their K-means segmentations obtained using various values of K. This also illustrates of the use of vector quantization for da
|
|