这篇文章解决了以下问题:
高维数据包括具有几十到几千个特征(或维度)的输入。这是一个典型的上下文问题,例如在生物信息学(各种排序数据)或NLP中,如果词汇量非常大,就会遇到这种情况。高维数据是具有挑战性的,因为:
什么是子空间聚类?子空间聚类是一种在不同子空间(一个或多个维度的选择)中发现聚类的技术。基本的假设是,我们可以找到只由维度子集定义的有效聚类(不需要具有所有N个特征的一致性)。子空间聚类是传统N维聚类分析的扩展,它允许通过创建行和列聚类同时对特征和观测进行分组。 在特征和观察的空间中,产生的聚类可能重叠。一个例子如下图所示。我们可以注意到,两个聚类中的点可能非常接近,这可能会混淆许多分析整个特征空间的传统聚类算法。 此外,我们可以看到子空间聚类成功地找到了一个子空间(维a和维c),其中预期的集群很容易识别。 子空间聚类的类型基于搜索策略,我们可以区分两种类型的子空间聚类,如下图所示:自下而上的方法首先在低维(1 D)空间中找到聚类并迭代合并它们以处理更高维空间(直到ND )。自上而下的方法在整个维度集中查找聚类并评估每个聚类的子空间。下图取自同一篇论文,概述了最常见的子空间聚类算法。 在搜索策略上,我们可以区分两种子空间聚类,如下图所示:自下而上的方法首先在低维(1d)空间中寻找聚类,然后迭代合并它们以处理高维空间(直到N D)。下图取自篇论文(https://www./exploration_files/parsons.pdf),概述了最常见的子空间聚类算法。 Clique算法简而言之,该算法的功能如下:对于每个维度(特征),我们用nBins(输入参数)分割空间;对于每个bin,我们计算直方图(计数数量)。我们只考虑dense单元,即计数高于作为第二个输入参数给定的阈值的bins。dense单元的特点如下:
导入Python库 import osimport sysimport numpy as npimport scipy.sparse.csgraphfrom sklearn import preprocessingfrom sklearn import metricsimport matplotlib.pyplot as pltimport pandas as pdfrom functools import reduceimport seaborn as snsfrom collections import Counterimport itertools 在2D中生成4个聚类 from sklearn.datasets.samples_generator import make_blobsfrom sklearn.preprocessing import StandardScalern_components = 4data, truth = make_blobs(n_samples=100, centers=n_components, random_state=42, n_features=2)data = preprocessing.MinMaxScaler().fit_transform(data)plt.scatter(data[:, 0], data[:, 1], s=50, c = truth)plt.title(f'Example of a mixture of {n_components} distributions')plt.xlabel('Feature 1')plt.ylabel('Feature 2'); class DenseUnit1D: ''' This class ''' def __init__(self, dimension, bin, minBin, maxBin, points): self.dimension = dimension # dimension index self.bin = bin # bin number self.minBin = minBin # inferior bin limit self.maxBin = maxBin # superior bin limit self.points = points # observation indexes in input data def distance(self, du): # Not in the same dimension, can't be neighbors if self.dimension != du.dimension: return -1 return abs(self.bin -du.bin) def __eq__(self, other): '''Overrides the default implementation''' if isinstance(other, DenseUnit): return (Counter(self.dimension) == Counter(other.dimension) and Counter(self.points) == Counter(other.points) ) return False def __hash__(self): return hash(str(self)) def __str__(self): return (f'Dimension {self.dimension}, bin {self.bin}, points {len(self.points)},' + f'[{round(self.minBin, 2)}, {round(self.maxBin, 2)}]')def neighbour(denseUnits1, denseUnits2): ''' Determines if 2 dense units are neighbouring ''' # We allow only 1 bin deviation in one subspace distance = 0 for subspace in range(len(denseUnits1)): subspaceDistance = denseUnits1[subspace].distance(denseUnits2[subspace]) if subspaceDistance == -1: return False distance += subspaceDistance if distance > 1: return 0 return True 设置参数 thresholdPoints = 2densenbBins = 8 创建一dimensionals dense单元 def createDenseUnitsAndGrid(data, thresholdPoints = thresholdPoints, nbBins = nbBins): ''' This method will return an array of lists, each list containing 1D dense units In 1 D subspace, each list will contain only one element ''' denseUnits1D = [] grid=[] # this is used for rendering purposes for curDim in range(data.shape[1]): minDim = min(data[:,curDim]) maxDim = max(data[:,curDim]) binSize = (maxDim - minDim)/nbBins points = data[:, curDim] g = [] # grid lines for current dimension g.append(minDim) for i in range(nbBins): endBin = minDim + binSize g.append(endBin) # Retrieve bin points per dimension if i == nbBins -1 : # last bin, make sure all points are included binPoints = np.where((points>=minDim) & (points<=maxDim))[0] endBin = maxDim else: binPoints = np.where((points>=minDim) & (points<endBin))[0] # Store only dense bins if len(binPoints)>thresholdPoints: denseUnits1D.append([DenseUnit1D(curDim, i,minDim,endBin,binPoints)]) minDim = endBin grid.append(g) return denseUnits1D, griddenseUnits1D, grid = createDenseUnitsAndGrid(data) 在网格上绘制原始数据集 plt.scatter(data[:, 0], data[:, 1])for g in grid[0]: plt.axvline(x=g, c = 'red', linestyle ='--') plt.xlabel('Feature 0') for g in grid[1]: plt.axhline(y=g, c = 'red', linestyle ='--') plt.ylabel('Feature 1') clique算法背后的直觉是存在于k维空间中的聚类也可以在k-1中找到。我们从1D开始,每个维度我们都试图找到dense bins。如果2个或更多dense bins相邻,我们将它们合并成一个更大的bin。通过将所有现有dense bins转换为图可以容易地实现该操作,其中如果2个dense单元属于相同维度并且它们的bin索引之间的差异不大于1(例如,特征3和bin 4对应的dense单元与特征3和bin 5对应的dense单元相邻)则绘制边缘。可以通过计算上述图上的连通分量来识别要合并的dense单元。 def denseBinsToClusters(candidates, plot = False, debug = False): ''' This method takes as input a collection of subspace candidates. A subspace candidate is a list of 1D dense units. This method will merge neighbouring units by projecting them onto a graph, where we can easily compute connected components ''' graph = np.identity(len(candidates)) for i in range(len(candidates)): for j in range(len(candidates)): graph[i, j] = int(neighbour(candidates[i], candidates[j])) # Find connected components in order to merge neighbouring bins nbConnectedComponents, components = scipy.sparse.csgraph.connected_components( graph, directed=False) if debug: print(graph) print(nbConnectedComponents, components) candidates = np.array(candidates) clusterAssignment = -1 * np.ones(data.shape[0]) # For every cluster for i in range(nbConnectedComponents): # Get dense units of the cluster cluster_dense_units = candidates[np.where(components == i)[0]] if debug: for v in cluster_dense_units: for z in v: print(z) clusterDimensions = {} for j in range(len(cluster_dense_units)): for k in range(len(cluster_dense_units[j])): if cluster_dense_units[j][k].dimension not in clusterDimensions: clusterDimensions[cluster_dense_units[j][k].dimension] = [] clusterDimensions[cluster_dense_units[j][k].dimension].extend(cluster_dense_units[j][k].points) points =reduce(np.intersect1d, list(clusterDimensions.values())) clusterAssignment[points] = i if plot: pred = -1 * np.ones(data.shape[0]) pred[points] = i plt.figure() plt.title(f'In yellow, clusters in {list(clusterDimensions.keys())} dimensions ') plt.scatter(data[:, 0], data[:, 1], c = pred) for g in grid[0]: plt.axvline(x=g, c = 'red', linestyle ='--') for g in grid[1]: plt.axhline(y=g, c = 'red', linestyle ='--') plt.show() if debug: print(clusterDimensions.keys(), points) return clusterAssignmentdenseBinsToClusters(denseUnits1D, plot = True, debug = False); 接下来,我们要计算每个子空间中从2到输入维数的所有有效聚类。该操作归结为计算k维度中的dense单元的组合,并且只保留大于初始最小密度阈值的dense continuous bins的重叠结果。一旦我们计算了k-1维度的dense单元,我们就可以通过计算最后k-1个候选者的所有组合来扩展到k维度。 def getSubspaceCandidates(previousUnits, subspaceDimension = 2): import itertools candidates = [] for ix1, ix2 in itertools.combinations(range(len(previousUnits)), 2): dims =[] candidate = [] for i in range(len(previousUnits[ix1])): dims.append(previousUnits[ix1][i].dimension) candidate.append(previousUnits[ix1][i]) points1= previousUnits[ix1][i].points for i in range(len(previousUnits[ix2])): dims.append(previousUnits[ix2][i].dimension) candidate.append(previousUnits[ix2][i]) points2= previousUnits[ix2][i].points points = np.intersect1d(points1, points2) # check points in common if np.unique(dims).shape[0] == subspaceDimension and points.shape[0]>thresholdPoints:# print(f'\n\nadding candidate: {len(points)}')# for v in candidate:# print(v) candidates.append(candidate) return candidatesfor subspaceDim in range(2, data.shape[1] +1): subspaceCandidates = getSubspaceCandidates(denseUnits1D, subspaceDimension = subspaceDim) pred = denseBinsToClusters(subspaceCandidates, plot = True, debug = False); 在2d中,我们能够检索如下图所示的聚类。紫色点被排除在聚类之外,因为它们位于稀疏的网格bins中。 plt.scatter(data[:, 0], data[:, 1], c = pred)for g in grid[0]: plt.axvline(x=g, c = 'red', linestyle ='--') plt.xlabel('Feature 0') for g in grid[1]: plt.axhline(y=g, c = 'red', linestyle ='--') plt.ylabel('Feature 1') 验证结果 from sklearn.metrics.cluster import adjusted_rand_scoreadjusted_rand_score(truth, pred) 0.9090183974614624 Clique聚类因其对输入参数(bins数和最小密度)的高灵敏度而受到批评,这可能导致非常不同的结果。但是,它是自下而上子空间聚类族中的一种基本算法。有多种方法可以优化clique算法。 |
|