【原】一个完美的单细胞亚群随机森林分离器是如何炼成的

健明 2022-11-05 发布于广东

展开全文

常规的单细胞转录组降维聚类分群和生物学命名大家应该是都看厌了，而且确实很简单，所以我们前些日子想另辟蹊径，见：各个单细胞亚群的特异性基因集合的打分能准确划分其亚群吗？。首先把单细胞分成有区分度的生物学亚群，然后找各个亚群的特异性基因，然后对这些基因列表在单细胞转录组表达量矩阵里面进行打分，发现也是可以蛮好的区分之前的单细胞亚群。

虽然两个CD4的T细胞其实大量共享高表达量基因，两个单核细胞也是如此，而CD8和NK也是如此，所以它们的AddModuleScore打分也是有些微混杂，不过最重要的问题是我们的可视化并没有体现出来AddModuleScore打分是否是足够好的分类器。

实际上，机器学习这个时候可以派上用场，我们首先演示随机森林的用法，并且简单肉眼看看它的效果。还是前面的PBMC3K数据集：

首先单细胞各个亚群找特异性高表达量基因：

# devtools::install_github('satijalab/seurat-data')
library(SeuratData) #加载seurat数据集  
getOption('timeout')
options(timeout=10000)
# InstallData("pbmc3k")  
data("pbmc3k")  
library(Seurat) 
sce <- pbmc3k.final  
# 需要合并一些单细胞亚群：
levels(Idents(sce))
## Assigning cell type identity to clusters
new.cluster.ids <- c("CD4 T","CD4 T", "Mono",  "B", "CD8 T",
                     "Mono", "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(sce)
sce <- RenameIdents(sce, new.cluster.ids)
DimPlot(sce, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()

future::plan("multiprocess", workers = 4) 
sce.markers <- FindAllMarkers(object = sce, only.pos = TRUE, 
                              min.pct = 0.25, 
                              thresh.use = 0.25)
save(sce.markers,file = 'sce.markers.FindAllMarkers.pbmc.Rdata')

load(file = 'sce.markers.FindAllMarkers.pbmc.Rdata')
library(dplyr) 
top10 <- sce.markers %>% group_by(cluster) %>% top_n(10, avg_log2FC)

然后把单细胞表达量矩阵划分为训练集和测试集

sce <- NormalizeData(sce, normalization.method =  "LogNormalize",  
                         scale.factor = 1e4)
GetAssay(sce,assay = "RNA")
sce <- FindVariableFeatures(sce, 
                                selection.method = "vst", nfeatures = 2000)  
sce <- ScaleData(sce) 
sce <- ScaleData(sce,features = unique(top10$gene)) 

t_expr <- t(as.matrix( sce@assays$RNA@scale.data ))
dim(t_expr) 
t_expr[1:4,1:4] 

# 1. 划分训练集和验证集
library(randomForest)
library(caret)
library(pROC)
library(caret)

inTrain<-createDataPartition(y= Idents(sce) ,p=0.25,list=F)
test_expr <-t_expr[inTrain,]
train_expr <-t_expr[-inTrain,]

test_y <- Idents(sce)[inTrain]
train_y  <- Idents(sce)[-inTrain] 
save(test_y,train_y,
     test_expr,train_expr,file = 'input.Rdata')

千万要注意，这个时候我使用的是sce@assays$RNA@scale.data 里面的表达量矩阵，而且里面的基因就是我们前面的各个亚群找特异性高表达量基因列表哦。

接着使用 randomForest 函数在训练集构建模型

library(randomForest)
library(caret)
library(pROC)
library(caret)
load(file = 'input.Rdata')

train_expr[1:4,1:4]
table(train_y)
 
predictor_data = train_expr 
target = train_y 

# 直接使用 randomForest 函数即可：
rf_output=randomForest(x=predictor_data, y=target,
                       importance = TRUE, ntree = 10001, proximity=TRUE )
rf_output
save(rf_output,file='rf_output.Rdata')

在测试集上面看模型效果


# 构建好的随机森林模型，首先自我预测，在前面的75%的训练集，这里略
load(file='rf_output.Rdata')
load(file = 'input.Rdata')

# 然后预测我们预留下来的另外的25%的测试集
test_outputs <- predict(rf_output,newdata = test_expr,type="prob")
test_expr[1:4,1:4]
head(test_outputs)
pred_y = colnames(test_outputs)[apply(test_outputs, 1, which.max)]
pred_y = factor(pred_y,levels = levels(test_y))

pdf('RF-performance.pdf',width = 10)
gplots::balloonplot(table(pred_y,test_y))
dev.off()

简单的肉眼就可以看到这个单细胞随机森林分离器非常完美，基本上没有什么误差：

单细胞随机森林分离器非常完美

当然了，如果是系统性学习过机器学习算法，理论上我们的这样的分类器应该是有评价指标，而不是简单的肉眼看。StatQuest生物统计学视频是一个很优秀的生物统计学教程，教程作者是Josh Starmer （个人博客https:///），生信菜鸟图很早之前就推过相关的学习资源。而且还组建过学习小分队，给视频写配套笔记：

另外推荐生信菜鸟团的《周日-鲍志炜专栏》

如果是是python呢，我们生信菜鸟团的《周日-鲍志炜专栏》也有一个机器学习系列教程，目录如下：

写在文末

我在《生信技能树》，《生信菜鸟团》，《单细胞天地》的大量推文教程里面共享的代码都是复制粘贴即可使用的，有任何疑问欢迎留言讨论，也可以发邮件给我，详细描述你遇到的困难的前因后果给我，我的邮箱地址是 jmzeng1314@163.com

如果你确实觉得我的教程对你的科研课题有帮助，让你茅塞顿开，或者说你的课题大量使用我的技能，烦请日后在发表自己的成果的时候，加上一个简短的致谢，如下所示：

We thank Dr.Jianming Zeng(University of Macau), and all the members of his bioinformatics team, biotrainee, for generously sharing their experience and codes.

十年后我环游世界各地的高校以及科研院所（当然包括中国大陆）的时候，如果有这样的情谊，我会优先见你。