MakeDensityBasedClusterer不是一个特别的聚类算法,它只是对一个聚类的算法封装,可以用logDensityPerClusterForInstance得到一个样本在相应的簇上的分布。 从buildClusterer函数开始看起: m_wrappedClusterer.buildClusterer(data); m_model = new DiscreteEstimator[m_wrappedClusterer .numberOfClusters()][data.numAttributes()]; m_modelNormal = new double[m_wrappedClusterer.numberOfClusters()][data .numAttributes()][2]; double[][] weights = new double[m_wrappedClusterer .numberOfClusters()][data.numAttributes()]; m_priors = new double[m_wrappedClusterer.numberOfClusters()]; for (int i = 0; i < m_wrappedClusterer.numberOfClusters(); i ) { m_priors[i] = 1.0; // laplace correction for (int j = 0; j < data.numAttributes(); j ) { if (data.attribute(j).isNominal()) { m_model[i][j] = new DiscreteEstimator(data.attribute(j) .numValues(), true); } } } 这里m_wrappedClusterer是被封装的对象,它是一个聚类算法类的对象。m_model和m_modelNormal分别是保存离散和连续属性分布的数组。Weights用于记录每个簇中的每个属性的样本总权重。m_priors保存簇的先验概率。m_wrappedClusterer.buildClusterer默认使用K-Means算法对数据进行聚类。而最后的两重循环是对m_priors进行laplace修正,和对m_model进行初始化。 // Compute mean, etc. int[] clusterIndex = new int[data.numInstances()]; for (int i = 0; i < data.numInstances(); i ) { inst = data.instance(i); int cluster = m_wrappedClusterer.clusterInstance(inst); m_priors[cluster] = inst.weight(); for (int j = 0; j < data.numAttributes(); j ) { if (!inst.isMissing(j)) { if (data.attribute(j).isNominal()) { m_model[cluster][j].addValue(inst.value(j), inst .weight()); } else { m_modelNormal[cluster][j][0] = inst.weight() * inst.value(j); weights[cluster][j] = inst.weight(); } } } clusterIndex[i] = cluster; } 这里cluster变量是簇的ID,m_priors记录这个簇的样本总权重,如果属性是离散值,则用DiscreteEstimator的addValue进行统计: public void addValue(double data, double weight) { m_Counts[(int) data] = weight; m_SumOfCounts = weight; } 如果是连续值,暂时计算加权总和。clusterIndex记录每个样本属于哪个簇。 for (int j = 0; j < data.numAttributes(); j ) { if (data.attribute(j).isNumeric()) { for (int i = 0; i < m_wrappedClusterer.numberOfClusters(); i ) { if (weights[i][j] > 0) { m_modelNormal[i][j][0] /= weights[i][j]; } } } } 这里计算连续属性平均值的后一部分,就是除以总的样本权重和,这里m_modelNormal[i][j][0]保存的是每个簇的每个属性的平均值。 // Compute standard deviations for (int i = 0; i < data.numInstances(); i ) { inst = data.instance(i); for (int j = 0; j < data.numAttributes(); j ) { if (!inst.isMissing(j)) { if (data.attribute(j).isNumeric()) { double diff = m_modelNormal[clusterIndex[i]][j][0] - inst.value(j); m_modelNormal[clusterIndex[i]][j][1] = inst.weight() * diff * diff; } } } } 这是计算方差的前一部分,把与均值的差的加权平方和得到。 for (int j = 0; j < data.numAttributes(); j ) { if (data.attribute(j).isNumeric()) { for (int i = 0; i < m_wrappedClusterer.numberOfClusters(); i ) { if (weights[i][j] > 0) { m_modelNormal[i][j][1] = Math .sqrt(m_modelNormal[i][j][1] / weights[i][j]); } else if (weights[i][j] <= 0) { m_modelNormal[i][j][1] = Double.MAX_VALUE; } if (m_modelNormal[i][j][1] <= m_minStdDev) { m_modelNormal[i][j][1] = data.attributeStats(j) .numericStats.stdDev; if (m_modelNormal[i][j][1] <= m_minStdDev) { m_modelNormal[i][j][1] = m_minStdDev; } } } } } 这里是计算方差的后面一部分代码,除以总权重,再开平方,再下面的是判断是否小于一个指定的最小权重m_minStdDev,只是没有必要计算那么小的方差。即m_modelNormal[i][j][1]中保存的是每个簇每个属性的方差。 Utils.normalize(m_priors); m_priors先前保存的是每个簇的样本之和,现在归一化处理。 public double[] logDensityPerClusterForInstance(Instance inst) throws Exception { int i, j; double logprob; double[] wghts = new double[m_wrappedClusterer.numberOfClusters()];
for (i = 0; i < m_wrappedClusterer.numberOfClusters(); i ) { logprob = 0; for (j = 0; j < inst.numAttributes(); j ) { if (!inst.isMissing(j)) { if (inst.attribute(j).isNominal()) { logprob = Math.log(m_model[i][j] .getProbability(inst.value(j))); } else { // numeric attribute logprob = logNormalDens(inst.value(j), m_modelNormal[i][j][0], m_modelNormal[i][j][1]); } } } wghts[i] = logprob; } return wghts; } 计算样本属性值在相应簇上的概率和,如果是离散值,很简单: public double getProbability(double data) { if (m_SumOfCounts == 0) { return 0; } return (double) m_Counts[(int) data] / m_SumOfCounts; } 如果是连续值,则假定服从正态分布: private double logNormalDens(double x, double mean, double stdDev) { double diff = x - mean;
return -(diff * diff / (2 * stdDev * stdDev)) - m_normConst - Math.log(stdDev); } whts[i]就是第i个簇的概率之和,确切的说应该是条件概率之和。
|
|