在之前的文章中分享了一致性聚类的原理,本文介绍下如何用R语言进行分析。ConsensusClusterPlus这个R包,就是专门用于一致性聚类分析的,为了简化调用,甚至将所有的步骤都封装到了一个函数里面,所以其使用方法非常的简单,一共三步 1. 加载R包 2. 把表达量数据读进去 3. 运行一致性聚类的函数 是不是和把大象装进冰箱一样简单,但是我们必须注意,这样简单的背后,实际是一个黑盒子,如果不了解原理,你只能得到结果,但是结果说明了什么信息,你一无所知。 下面是具体步骤 1. 准备输入数据 行为基因,列为样本的表达量数据,为了获得最佳的聚类效果,可以对基因进行筛选, 对矩阵进行归一化操作,代码如下 > library(ALL) > data(ALL) > d=exprs(ALL) # 表达量数据 > d[1:5,1:5] 01005 01010 03002 04006 04007 1000_at 7.597323 7.479445 7.567593 7.384684 7.905312 1001_at 5.046194 4.932537 4.799294 4.922627 4.844565 1002_f_at 3.900466 4.208155 3.886169 4.206798 3.416923 1003_s_at 5.903856 6.169024 5.860459 6.116890 5.687997 1004_at 5.925260 5.912780 5.893209 6.170245 5.615210 > mad(d[1, ]) [1] 0.2701619 > mads=apply(d,1,mad) > d=d[rev(order(mads))[1:5000],] > dim(d) [1] 5000 128 # 归一化操作 > d = sweep(d,1, apply(d,1,median,na.rm=T)) > dim(d) [1] 5000 128 > d[1:5,1:5] 01005 01010 03002 04006 04007 36638_at 1.5561207 0.9521271 -0.05018082 4.780378 3.93006775 39318_at 1.1913532 2.5013225 -2.38793537 -1.199521 1.93626914 38514_at 1.0207162 3.2785671 1.55949145 -3.345919 -0.01548269 266_s_at 1.8292604 0.3624327 1.54913247 -1.286294 1.75669694 38585_at -0.9240204 0.1895020 3.44968363 -2.216822 5.18702726
2. 运行ConsensusClusterPlus ConsensusClusterPlus就是核心函数了,包括了以下几个参数 1. pItem, 选择80%的样本进行重复抽样 2. pfeature, 选择80%的基因进行重复抽样 3. maxK, 最大的K值,形成一系列梯度 4. reps, 重复抽样的数目 5. clusterAlg, 层次聚类的算法 6. distanc, 距离矩阵的算法 7. title, 输出结果的文件夹名字,包含了输出的图片 8. seed, 随机种子,用于重复结果 注意,在实际运行中,推荐reps设置的更大,比如1000, maxK设置的更大,比如20,具体代码如下 > library(ConsensusClusterPlus) > title=tempdir() > results = ConsensusClusterPlus(d,maxK=6,reps=50,pItem=0.8,pFeature=1, title=title,clusterAlg="hc",distance="pearson",seed=1262118388.71279,plot="png", writeTable = TRUE) end fraction clustered clustered clustered clustered clustered
函数的返回值是一个列表,每个列表子项对应给具体的K, K最小值为2 > str(results[[2]]) List of 5 $ consensusMatrix: num [1:128, 1:128] 1 1 0.895 1 1 ... $ consensusTree :List of 7 ..$ merge : int [1:127, 1:2] -1 -4 -5 -6 -7 -9 -11 -12 -14 -15 ... ..$ height : num [1:127] 0 0 0 0 0 0 0 0 0 0 ... ..$ order : int [1:128] 101 128 127 126 125 124 123 122 121 120 ... ..$ labels : NULL ..$ method : chr "average" ..$ call : language hclust(d = as.dist(1 - fm), method = finalLinkage) ..$ dist.method: NULL ..- attr(*, "class")= chr "hclust" $ consensusClass : Named int [1:128] 1 1 1 1 1 1 1 1 1 1 ... ..- attr(*, "names")= chr [1:128] "01005" "01010" "03002" "04006" ... $ ml : num [1:128, 1:128] 1 1 0.895 1 1 ... $ clrs :List of 3 ..$ : chr [1:128] "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" ... ..$ : num 2 ..$ : chr [1:2] "#A6CEE3" "#1F78B4"
# 一致性矩阵,样本的邻接矩阵 > dim(d) [1] 5000 128
> dim(results[[2]][["consensusMatrix"]]) [1] 128 128
> results[[2]][["consensusMatrix"]][1:5,1:5] [,1] [,2] [,3] [,4] [,5] [1,] 1.0000000 1.0000000 0.8947368 1.0000000 1.000000 [2,] 1.0000000 1.0000000 0.9142857 1.0000000 1.000000 [3,] 0.8947368 0.9142857 1.0000000 0.8857143 0.969697 [4,] 1.0000000 1.0000000 0.8857143 1.0000000 1.000000 [5,] 1.0000000 1.0000000 0.9696970 1.0000000 1.000000
> results[[2]][["consensusTree"]]
Call: hclust(d = as.dist(1 - fm), method = finalLinkage)
Cluster method : average Number of objects: 128
# 样本的聚类树 > results[[2]][["consensusTree"]]
Call: hclust(d = as.dist(1 - fm), method = finalLinkage)
Cluster method : average Number of objects: 128
# consensusClass, 样本的聚类结果 > length(results[[2]][["consensusClass"]]) [1] 128 > results[[2]][["consensusClass"]][1:5] 01005 01010 03002 04006 04007 1 1 1 1 1
# ml, 就是consensusMatrix > results[[2]][["ml"]][1:5,1:5] [,1] [,2] [,3] [,4] [,5] [1,] 1.0000000 1.0000000 0.8947368 1.0000000 1.000000 [2,] 1.0000000 1.0000000 0.9142857 1.0000000 1.000000 [3,] 0.8947368 0.9142857 1.0000000 0.8857143 0.969697 [4,] 1.0000000 1.0000000 0.8857143 1.0000000 1.000000 [5,] 1.0000000 1.0000000 0.9696970 1.0000000 1.000000 > results[[2]][["consensusMatrix"]][1:5,1:5] [,1] [,2] [,3] [,4] [,5] [1,] 1.0000000 1.0000000 0.8947368 1.0000000 1.000000 [2,] 1.0000000 1.0000000 0.9142857 1.0000000 1.000000 [3,] 0.8947368 0.9142857 1.0000000 0.8857143 0.969697 [4,] 1.0000000 1.0000000 0.8857143 1.0000000 1.000000 [5,] 1.0000000 1.0000000 0.9696970 1.0000000 1.000000
# clrs, 颜色 > results[[2]][["clrs"]] [[1]] [1] "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" [13] "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" [25] "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" [37] "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" [49] "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" [61] "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" [73] "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" [85] "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#1F78B4" [97] "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" "#A6CEE3" "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" [109] "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" "#A6CEE3" "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" [121] "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4"
[[2]] [1] 2
[[3]] [1] "#A6CEE3" "#1F78B4"
3. 收集cluster-consensus和item-consensus 矩阵 代码如下
> icl = calcICL(results,title=title,plot="png") > icl[["clusterConsensus"]] k cluster clusterConsensus [1,] 2 1 0.7681668 [2,] 2 2 0.9788274 [3,] 3 1 0.6176820 [4,] 3 2 0.9190744 [5,] 3 3 1.0000000 [6,] 4 1 0.8446083 [7,] 4 2 0.9067267 [8,] 4 3 0.6612850 [9,] 4 4 1.0000000 [10,] 5 1 0.8175802 [11,] 5 2 0.9066489 [12,] 5 3 0.6062040 [13,] 5 4 0.8154580 [14,] 5 5 1.0000000 [15,] 6 1 0.7511726 [16,] 6 2 0.8802040 [17,] 6 3 0.7410730 [18,] 6 4 0.8154580 [19,] 6 5 0.7390864 [20,] 6 6 1.0000000
> dim(icl[["itemConsensus"]]) [1] 2560 4 > 128 * (2 + 3 + 4 + 5 + 6) [1] 2560
> icl[["itemConsensus"]][1:5,] k cluster item itemConsensus 1 2 1 28031 0.6173782 2 2 1 28023 0.5797202 3 2 1 43012 0.5961974 4 2 1 28042 0.5644619 5 2 1 28047 0.6259350
4. 结果解读 在输出文件夹中,包含了多种输出可视化结果,每种结果的含义如下 1)consensus matrix 热图 consensus matrix 为样本方阵,数值代表两个同属一个cluster的可能性,取值范围从0到1, 颜色从白色到深蓝色 2)consensus 累计分布图 CDF 对于每个K对应的consensus matrix, 采用100个bin的柱状图来计算累计分布, CDF图可以用来帮助决定最佳的K值 3)delta area plot 对于每个K, 计算K和K-1相比,CDF 曲线下面积的相对变化,对于K=2, 因为没有K=1, 所以是totla CDF curve area,选取增加不明显的点作为最佳的K值 4)tracling plot 行为样本,列为每个K, 用热图展示样本在每个K下的cluster, 用于定性评估不稳定的聚类和不稳定的样本
|