【原】ConsensusClusterPlus，一步到位的一致性聚类！

生信修炼手册 2022-04-27

展开全文

在之前的文章中分享了一致性聚类的原理，本文介绍下如何用R语言进行分析。ConsensusClusterPlus这个R包，就是专门用于一致性聚类分析的，为了简化调用，甚至将所有的步骤都封装到了一个函数里面，所以其使用方法非常的简单，一共三步

1. 加载R包

2. 把表达量数据读进去

3. 运行一致性聚类的函数

是不是和把大象装进冰箱一样简单，但是我们必须注意，这样简单的背后，实际是一个黑盒子，如果不了解原理，你只能得到结果，但是结果说明了什么信息，你一无所知。

下面是具体步骤

1. 准备输入数据

行为基因，列为样本的表达量数据，为了获得最佳的聚类效果，可以对基因进行筛选，对矩阵进行归一化操作，代码如下

> library(ALL)> data(ALL)> d=exprs(ALL)# 表达量数据> d[1:5,1:5]             01005    01010    03002    04006    040071000_at   7.597323 7.479445 7.567593 7.384684 7.9053121001_at   5.046194 4.932537 4.799294 4.922627 4.8445651002_f_at 3.900466 4.208155 3.886169 4.206798 3.4169231003_s_at 5.903856 6.169024 5.860459 6.116890 5.6879971004_at   5.925260 5.912780 5.893209 6.170245 5.615210> mad(d[1, ])[1] 0.2701619> mads=apply(d,1,mad)> d=d[rev(order(mads))[1:5000],]> dim(d)[1] 5000  128# 归一化操作> d = sweep(d,1, apply(d,1,median,na.rm=T))> dim(d)[1] 5000  128> d[1:5,1:5]              01005     01010       03002     04006       0400736638_at  1.5561207 0.9521271 -0.05018082  4.780378  3.9300677539318_at  1.1913532 2.5013225 -2.38793537 -1.199521  1.9362691438514_at  1.0207162 3.2785671  1.55949145 -3.345919 -0.01548269266_s_at  1.8292604 0.3624327  1.54913247 -1.286294  1.7566969438585_at -0.9240204 0.1895020  3.44968363 -2.216822  5.18702726

2. 运行ConsensusClusterPlus

ConsensusClusterPlus就是核心函数了，包括了以下几个参数

1. pItem, 选择80%的样本进行重复抽样

2. pfeature, 选择80%的基因进行重复抽样

3. maxK, 最大的K值，形成一系列梯度

4. reps, 重复抽样的数目

5. clusterAlg, 层次聚类的算法

6. distanc, 距离矩阵的算法

7. title, 输出结果的文件夹名字，包含了输出的图片

8. seed, 随机种子，用于重复结果

注意，在实际运行中，推荐reps设置的更大，比如1000， maxK设置的更大，比如20，具体代码如下

> library(ConsensusClusterPlus)> title=tempdir()> results = ConsensusClusterPlus(d,maxK=6,reps=50,pItem=0.8,pFeature=1, title=title,clusterAlg="hc",distance="pearson",seed=1262118388.71279,plot="png", writeTable = TRUE)end fractionclusteredclusteredclusteredclusteredclustered

函数的返回值是一个列表，每个列表子项对应给具体的K， K最小值为2

> str(results[[2]])List of 5$ consensusMatrix: num [1:128, 1:128] 1 1 0.895 1 1 ...$ consensusTree  :List of 7  ..$ merge      : int [1:127, 1:2] -1 -4 -5 -6 -7 -9 -11 -12 -14 -15 ...  ..$ height     : num [1:127] 0 0 0 0 0 0 0 0 0 0 ...  ..$ order      : int [1:128] 101 128 127 126 125 124 123 122 121 120 ...  ..$ labels     : NULL  ..$ method     : chr "average"  ..$ call       : language hclust(d = as.dist(1 - fm), method = finalLinkage)  ..$ dist.method: NULL  ..- attr(*, "class")= chr "hclust"$ consensusClass : Named int [1:128] 1 1 1 1 1 1 1 1 1 1 ...  ..- attr(*, "names")= chr [1:128] "01005" "01010" "03002" "04006" ...$ ml             : num [1:128, 1:128] 1 1 0.895 1 1 ...$ clrs           :List of 3  ..$ : chr [1:128] "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" ...  ..$ : num 2  ..$ : chr [1:2] "#A6CEE3" "#1F78B4"
# 一致性矩阵，样本的邻接矩阵> dim(d)[1] 5000  128
> dim(results[[2]][["consensusMatrix"]])[1] 128 128
> results[[2]][["consensusMatrix"]][1:5,1:5]          [,1]      [,2]      [,3]      [,4]     [,5][1,] 1.0000000 1.0000000 0.8947368 1.0000000 1.000000[2,] 1.0000000 1.0000000 0.9142857 1.0000000 1.000000[3,] 0.8947368 0.9142857 1.0000000 0.8857143 0.969697[4,] 1.0000000 1.0000000 0.8857143 1.0000000 1.000000[5,] 1.0000000 1.0000000 0.9696970 1.0000000 1.000000
> results[[2]][["consensusTree"]]

Call:hclust(d = as.dist(1 - fm), method = finalLinkage)

Cluster method   : averageNumber of objects: 128
# 样本的聚类树> results[[2]][["consensusTree"]]

Call:hclust(d = as.dist(1 - fm), method = finalLinkage)

Cluster method   : averageNumber of objects: 128
# consensusClass， 样本的聚类结果> length(results[[2]][["consensusClass"]])[1] 128> results[[2]][["consensusClass"]][1:5]01005 01010 03002 04006 04007    1     1     1     1     1

# ml, 就是consensusMatrix> results[[2]][["ml"]][1:5,1:5]          [,1]      [,2]      [,3]      [,4]     [,5][1,] 1.0000000 1.0000000 0.8947368 1.0000000 1.000000[2,] 1.0000000 1.0000000 0.9142857 1.0000000 1.000000[3,] 0.8947368 0.9142857 1.0000000 0.8857143 0.969697[4,] 1.0000000 1.0000000 0.8857143 1.0000000 1.000000[5,] 1.0000000 1.0000000 0.9696970 1.0000000 1.000000> results[[2]][["consensusMatrix"]][1:5,1:5]          [,1]      [,2]      [,3]      [,4]     [,5][1,] 1.0000000 1.0000000 0.8947368 1.0000000 1.000000[2,] 1.0000000 1.0000000 0.9142857 1.0000000 1.000000[3,] 0.8947368 0.9142857 1.0000000 0.8857143 0.969697[4,] 1.0000000 1.0000000 0.8857143 1.0000000 1.000000[5,] 1.0000000 1.0000000 0.9696970 1.0000000 1.000000
# clrs, 颜色> results[[2]][["clrs"]][[1]]  [1] "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3"[13] "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3"[25] "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3"[37] "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3"[49] "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3"[61] "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3"[73] "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3"[85] "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#1F78B4"[97] "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" "#A6CEE3" "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4"[109] "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" "#A6CEE3" "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4"[121] "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4"

[[2]][1] 2

[[3]][1] "#A6CEE3" "#1F78B4"

3. 收集cluster-consensus和item-consensus 矩阵

代码如下

> icl = calcICL(results,title=title,plot="png")> icl[["clusterConsensus"]]      k cluster clusterConsensus[1,] 2       1        0.7681668[2,] 2       2        0.9788274[3,] 3       1        0.6176820[4,] 3       2        0.9190744[5,] 3       3        1.0000000[6,] 4       1        0.8446083[7,] 4       2        0.9067267[8,] 4       3        0.6612850[9,] 4       4        1.0000000[10,] 5       1        0.8175802[11,] 5       2        0.9066489[12,] 5       3        0.6062040[13,] 5       4        0.8154580[14,] 5       5        1.0000000[15,] 6       1        0.7511726[16,] 6       2        0.8802040[17,] 6       3        0.7410730[18,] 6       4        0.8154580[19,] 6       5        0.7390864[20,] 6       6        1.0000000
> dim(icl[["itemConsensus"]])[1] 2560    4> 128 * (2 + 3 + 4 + 5 + 6)[1] 2560
> icl[["itemConsensus"]][1:5,]  k cluster  item itemConsensus1 2       1 28031     0.61737822 2       1 28023     0.57972023 2       1 43012     0.59619744 2       1 28042     0.56446195 2       1 28047     0.6259350