DBScan是由论文A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise提出的,不过我想看看wiki里的解释就差不多了,甚至可以直接看代码,这算法概念很多,实现起来倒不难。 DBScan的算法如下,不过它与weka里的实现有点区别。 DBSCAN(D, eps, MinPts) C = 0 for each unvisited point P in dataset D mark P as visited N = getNeighbors (P, eps) if sizeof(N) < MinPts mark P as NOISE else C = next cluster expandCluster(P, N, C, eps, MinPts)
expandCluster(P, N, C, eps, MinPts) add P to cluster C for each point P' in N if P' is not visited mark P' as visited N' = getNeighbors(P', eps) if sizeof(N') >= MinPts N = N joined with N' if P' is not yet member of any cluster add P' to cluster C DBScan算法的优缺点,翻译自wiki: 优点: 1. DBScan不需要像k-means算法一样,要指定簇的个数。 2. DBScan可以聚出任意形状的簇,它甚至可以找到被另一个簇包围(但不连起来的)的簇,因为有MinPts参数,single-link现象(不同的簇被一个点连起的细线连起来了)被降低了。 3. DBScan有噪音的概念。 4. DBScan只需有两个参数,并且对数据库中点的排列顺序不敏感。 缺点: 1. 只有当getNeighbors(P,epsilon)中的距离公式合理时,DBScan才能产生好的簇,最常用的距离公式是欧几里德距离公式。但在高维数据时,这个距离公式简直是没用。 2. DBScan对变密度数据集效果不佳。 源码从buildClusterer开始看: database = databaseForName(getDatabase_Type(), filteredInstances); for (int i = 0; i < database.getInstances().numInstances(); i ) { DataObject dataObject = dataObjectForName( getDatabase_distanceType(), database.getInstances() .instance(i), Integer.toString(i), database); database.insert(dataObject); } database.setMinMaxValues(); database_Type=”weka.clusterers.forOPTICSAndDBScan.Databases.SequentialDatabase”为默认值,在databaseForName中: Object o = null;
Constructor co = null; try { co = (Class.forName(database_Type)) .getConstructor(new Class[] { Instances.class }); o = co.newInstance(new Object[] { instances }); } 中调用的是: public SequentialDatabase(Instances instances) { this.instances = instances; treeMap = new TreeMap(); } 将下来在buildClusterer中的dataObjectForName与databaseForName相似,调用的是: public EuclidianDataObject(Instance originalInstance, String key, Database database) { this.database = database; this.key = key; instance = originalInstance; clusterID = DataObject.UNCLASSIFIED; processed = false; c_dist = DataObject.UNDEFINED; r_dist = DataObject.UNDEFINED; } SequentialDatabase的insert函数为: public void insert(DataObject dataObject) { treeMap.put(dataObject.getKey(), dataObject); } dataObject.getKey返回的k,就是原数据集中样本的下标。 buildClusterer中的setMinMaxValues是得到每个属性的最小值(attributeMinValues[j])最大值(attributeMaxValues[j])。 Iterator iterator = database.dataObjectIterator(); while (iterator.hasNext()) { DataObject dataObject = (DataObject) iterator.next(); if (dataObject.getClusterLabel() == DataObject.UNCLASSIFIED) { if (expandCluster(dataObject)) { clusterID ; numberOfGeneratedClusters ; } } } 直接看expandCluster函数,expandCluster的调用的第一个函数是epsilonRangeQuery: public List epsilonRangeQuery(double epsilon, DataObject queryDataObject) { ArrayList epsilonRange_List = new ArrayList(); Iterator iterator = dataObjectIterator(); while (iterator.hasNext()) { DataObject dataObject = (DataObject) iterator.next(); double distance = queryDataObject.distance(dataObject); if (distance < epsilon) { epsilonRange_List.add(dataObject); } }
return epsilonRange_List; } 这里将与dataObject距离小于epsilon的放到epsilonRage_List中。 /** dataObject is NO coreObject */ if (seedList.size() < getMinPoints()) { dataObject.setClusterLabel(DataObject.NOISE); return false; } 这里就是前面伪代码中的Size(N)<MinPts,如果小于MinPts,也就是周围没有足够的点,就将它设为噪音。 /** dataObject is coreObject */ for (int i = 0; i < seedList.size(); i ) { DataObject seedListDataObject = (DataObject) seedList.get(i); /** label this seedListDataObject with the current clusterID, because it is in epsilon-range */ seedListDataObject.setClusterLabel(clusterID); if (seedListDataObject.equals(dataObject)) { seedList.remove(i); i--; } } 如果大于MinPts,就将seedListDataObject中的样本都标记为当前簇的ID(clusterID)。 /** Iterate the seedList of the startDataObject */ for (int j = 0; j < seedList.size(); j ) { DataObject seedListDataObject = (DataObject) seedList.get(j); List seedListDataObject_Neighbourhood = database.epsilonRangeQuery( getEpsilon(), seedListDataObject);
/** seedListDataObject is coreObject */ if (seedListDataObject_Neighbourhood.size() >= getMinPoints()) { for(int i = 0; i < seedListDataObject_Neighbourhood.size(); i ) { DataObject p = (DataObject) seedListDataObject_Neighbourhood .get(i); if (p.getClusterLabel() == DataObject.UNCLASSIFIED || p.getClusterLabel() == DataObject.NOISE) { if (p.getClusterLabel() == DataObject.UNCLASSIFIED) { seedList.add(p); } p.setClusterLabel(clusterID); } } } seedList.remove(j); j--; } 将刚才得到的邻居循环,将每个邻居(P’, seedListDataObject)再找邻居,如果邻居的邻居数大于MinPts,那么对这些邻居循环,如果这个点还是没有处理过的,就将它放到seedList中,也就是扩展。然后把未处理过的点和噪音点设为当前的簇ID,这里没有将噪音点放到seedList中的原因是噪音点以前就判断过它的邻居不会超过MinPts,所以没必要再处理一次。最后移除这个点,进行下一次循环(这种情况应该用whie循环比较好吧?)。 public int clusterInstance(Instance instance) throws Exception { if (processed_InstanceID >= database.size()) processed_InstanceID = 0; int cnum = (database.getDataObject(Integer .toString(processed_InstanceID ))).getClusterLabel(); if (cnum == DataObject.NOISE) throw new Exception(); else return cnum; } 这个是返回第processed_InstanceId个样本的簇ID,与instance没关系。也就是说,每次调用都返回下一个样本的簇ID。
|
|