背景介绍如果是bulk RNA-seq,那么现在最流行的就是DESeq2 和 edgeR啦,而且有很多经过了RT-qPCR 验证过的真实测序数据可以来评价不同的差异基因算法的表现。 对单细胞测序数据来说,通常需要先聚类之后把细胞群体进行分组,然后来比较不同的组的差异表达情况。当然,也有不少单细胞测序实验设计本身就有时间点,不同个体来源,不同培养条件这样的分组! 同时还有不少方法是不需要预先分类的,因为分类本身就会引入偏差。 跟bulk RNA-seq不一样的地方是,scRNA-seq通常涉及到的样本数量更多。这时候可以使用非参检验算法,比如Kolmogorov-Smirnov test (KS-test) 等等。 下面用一个测试数据来评价一下不同的算法的表现。处理同样的表达矩阵得到差异结果跟已知的差异结果进行比较看看overlap怎么样。评价指标主要是: 召回率,True positive rate (TPR), TP/(TP + FN) 准确率,False positive rate (FPR), FP/(FP+TP) receiver-operating-characteristic curve (ROC) area under this curve (AUC)
所以需要安装并且加载一些包,安装代码如下; install.packages('ROCR') ## try http:// if https:// URLs are not supported source("https:///biocLite.R") biocLite("MAST") biocLite("scde") install.packages("devtools") library("devtools") install_github("BPSC","nghiavtr") library("BPSC")
加载代码如下: library(ROCR) library(edgeR) library(DESeq2)
library(scde) library(BPSC) library(MAST) library(monocle)
加载测试数据这里选取的是芝加哥大学Yoav Gilad lab实验的Tung et al 2017的单细胞测序文章的数据 ## 读取tung文章的数据,生成测试数据,这个代码不需要运行。 if(F){
DE <- read.table("tung/TPs.txt") notDE <- read.table("tung/TNs.txt") GroundTruth <- list(DE=as.character(unlist(DE)), notDE=as.character(unlist(notDE)))
molecules <- read.table("tung/molecules.txt", sep = "\t") anno <- read.table("tung/annotation.txt", sep = "\t", header = TRUE) keep <- anno[,1] == "NA19101" | anno[,1] == "NA19239" data <- molecules[,keep] group <- anno[keep,1] batch <- anno[keep,4] # remove genes that aren't expressed in at least 6 cells gkeep <- rowSums(data > 0) > 5; counts <- data[gkeep,] # Library size normalization lib_size = colSums(counts) norm <- t(t(counts)/lib_size*median(lib_size)) # Variant of CPM for datasets with library sizes of fewer than 1 mil molecules rm(molecules) rm(data) save.image(file = 'scRNAseq_DEG_input.Rdata') } load(file = 'scRNAseq_DEG_input.Rdata') # 我已经把测试数据保存为rdata数据格式了,直接加载。 dim(counts);
## [1] 16026 576
dim(norm);
## [1] 16026 576
dim(DE);
## [1] 1083 1
dim(notDE);
## [1] 10897 1
table(group)
## group ## NA19098 NA19101 NA19239 ## 0 288 288
可以看到这里需要选择的测试数据来源于2个人,每个人都有288个细胞的表达数据。 就是要对它们进行差异比较,而已知的1083个基因是确定显著差异的,另外10897个基因是确定不显著的。(首先,我们要假定这个是金标准!!!) 但是总共却是16026个基因,所以有一些基因是不确定显著与否的。 差异分析方法大全Kolmogorov-Smirnov testKS检验有两个弊端,首先是它假设基因表达量是连续的,如果有很多细胞表达量一致,比如都是0,表现就很差。其次它对大样本量太敏感了,可能其实差异并不大,但是样本数量很多,也会被认为是显著差异。 pVals <- apply(norm, 1, function(x) { ks.test(x[group =="NA19101"], x[group=="NA19239"])$p.value }) # multiple testing correction pVals <- p.adjust(pVals, method = "fdr") sigDE <- names(pVals)[pVals < 0.05] length(sigDE)
## [1] 5095
# Number of KS-DE genes sum(GroundTruth$DE %in% sigDE)
## [1] 792
# Number of KS-DE genes that are true DE genes sum(GroundTruth$notDE %in% sigDE)
## [1] 3190
tp <- sum(GroundTruth$DE %in% sigDE) fp <- sum(GroundTruth$notDE %in% sigDE) tn <- sum(GroundTruth$notDE %in% names(pVals)[pVals >= 0.05]) fn <- sum(GroundTruth$DE %in% names(pVals)[pVals >= 0.05]) tpr <- tp/(tp + fn) fpr <- fp/(fp + tn) cat(c(tpr, fpr))
## 0.7346939 0.2944706
ks_pVals=pVals
可以看到KS检验判断的显著差异基因实在是太多了,高达5095个。所以它能找回来792个真正的差异基因。但是却找到了3190个假阳性。所以计算得到召回率73.46%,但是准确率只有29.44%,这个表现不佳。 再看看ROC和RUC # Only consider genes for which we know the ground truth pVals <- pVals[names(pVals) %in% GroundTruth$DE | names(pVals) %in% GroundTruth$notDE] truth <- rep(1, times = length(pVals)); truth[names(pVals) %in% GroundTruth$DE] = 0; pred <- ROCR::prediction(pVals, truth) perf <- ROCR::performance(pred, "tpr", "fpr") ROCR::plot(perf)
aucObj <- ROCR::performance(pred, "auc") aucObj@y.values[[1]] # AUC
## [1] 0.7954796
把这两个评价分析包装成函数,后面可以直接使用! DE_Quality_AUC <- function(pVals) { pVals <- pVals[names(pVals) %in% GroundTruth$DE | names(pVals) %in% GroundTruth$notDE] truth <- rep(1, times = length(pVals)); truth[names(pVals) %in% GroundTruth$DE] = 0; pred <- ROCR::prediction(pVals, truth) perf <- ROCR::performance(pred, "tpr", "fpr") ROCR::plot(perf) aucObj <- ROCR::performance(pred, "auc") return(aucObj@y.values[[1]]) } DE_Quality_rate <- function(sigDE) { (length(sigDE) ) # Number of KS-DE genes ( sum(GroundTruth$DE %in% sigDE) ) # Number of KS-DE genes that are true DE genes (sum(GroundTruth$notDE %in% sigDE)) tp <- sum(GroundTruth$DE %in% sigDE) fp <- sum(GroundTruth$notDE %in% sigDE) tn <- sum(GroundTruth$notDE %in% names(pVals)[pVals >= 0.05]) fn <- sum(GroundTruth$DE %in% names(pVals)[pVals >= 0.05]) tpr <- tp/(tp + fn) fpr <- fp/(fp + tn) cat(c(tpr, fpr)) }
Wilcox/Mann-Whitney-U Test也是一种非参检验,通常比较两个组数据的median的差异。 pVals <- apply(norm, 1, function(x) { wilcox.test(x[group =="NA19101"], x[group=="NA19239"])$p.value }) # multiple testing correction pVals <- p.adjust(pVals, method = "fdr") sigDE <- names(pVals)[pVals < 0.05] Wilcox_pVals=pVals DE_Quality_rate(sigDE)
## 0.8376623 0.3729346
DE_Quality_AUC(pVals)
## [1] 0.8320326
召回率是81.9%,准确率是31.9%,这个表现不佳。 edgeRedgeR包在bulk RNA-seq测序领域应用很广泛,基于负二项分布模型,应用了 generalized linear model (GLM) 算法 library(edgeR) dge <- DGEList(counts=counts, norm.factors = rep(1, length(counts[1,])), group=group) group_edgeR <- factor(group) design <- model.matrix(~group_edgeR) dge <- estimateDisp(dge, design = design, trend.method="none") fit <- glmFit(dge, design) res <- glmLRT(fit) pVals <- res$table[,4] names(pVals) <- rownames(res$table) pVals <- p.adjust(pVals, method = "fdr") sigDE <- names(pVals)[pVals < 0.05] edgeR_pVals=pVals DE_Quality_rate(sigDE)
## 0.8692022 0.3948121
DE_Quality_AUC(pVals)
## [1] 0.8477189
召回率是86.9%,准确率是39.4%,表现越来越好了。 Monoclemonocle不仅仅针对于基于read counts的表达矩阵,还可以是已经被各种normalization的表达矩阵,比如基于RPKM/TPM等等,它会把被normalization的表达矩阵用normal/gaussian model (gaussianff()) 算法处理一下。差异分析的时候同样也是基于负二项分布模型,应用了 generalized linear model (GLM) 算法 library(monocle) pd <- data.frame(group=group, batch=batch) rownames(pd) <- colnames(counts) pd <- new("AnnotatedDataFrame", data = pd) ## 针对于基于read counts的表达矩阵 Obj <- newCellDataSet(as.matrix(counts), phenoData=pd, expressionFamily=negbinomial.size()) Obj <- estimateSizeFactors(Obj) Obj <- estimateDispersions(Obj) res <- differentialGeneTest(Obj,fullModelFormulaStr="~group") pVals <- res[,3] names(pVals) <- rownames(res) pVals <- p.adjust(pVals, method = "fdr") sigDE <- names(pVals)[pVals < 0.05] monocle_pVals=pVals DE_Quality_rate(sigDE) DE_Quality_AUC(pVals)
monocle做差异分析的耗时非常夸张,召回率是84.2%,准确率是38.1% MASTMAST基于 zero-inflated negative binomial 分布模型 library(MAST) log_counts <- log(counts+1)/log(2) fData = data.frame(names=rownames(log_counts)) rownames(fData) = rownames(log_counts); cData = data.frame(cond=group) rownames(cData) = colnames(log_counts) obj <- FromMatrix(as.matrix(log_counts), cData, fData) colData(obj)$cngeneson <- scale(colSums(assay(obj)>0)) cond <- factor(colData(obj)$cond) # Model expression as function of condition & number of detected genes zlmCond <- zlm.SingleCellAssay(~cond + cngeneson, obj) summaryCond <- summary(zlmCond, doLRT="condNA19101") summaryDt <- summaryCond$datatable summaryDt <- as.data.frame(summaryDt) pVals <- unlist(summaryDt[summaryDt$component == "H",4]) # H = hurdle model names(pVals) <- unlist(summaryDt[summaryDt$component == "H",1]) pVals <- p.adjust(pVals, method = "fdr") sigDE <- names(pVals)[pVals < 0.05] MAST_pVals=pVals DE_Quality_rate(sigDE) DE_Quality_AUC(pVals)
召回率是82.8%,准确率是34.9.% BPSC这个用的是 Poisson-Beta 分布模型 library(BPSC) bpsc_data <- norm[,batch=="NA19101.r1" | batch=="NA19239.r1"] bpsc_group = group[batch=="NA19101.r1" | batch=="NA19239.r1"] control_cells <- which(bpsc_group == "NA19101") design <- model.matrix(~bpsc_group) coef=2 # group label res=BPglm(data=bpsc_data, controlIds=control_cells, design=design, coef=coef, estIntPar=FALSE, useParallel = FALSE) pVals = res$PVAL pVals <- p.adjust(pVals, method = "fdr") sigDE <- names(pVals)[pVals < 0.05] BPSC_pVals=pVals DE_Quality_rate(sigDE) DE_Quality_AUC(pVals)
召回率是64.8%,准确率是30.7.% SCDESCDE是第一个特意针对单细胞转录组测序数据的差异分析而设计的,用贝叶斯统计方法把表达矩阵拟合到 zero-inflated negative binomial 分布模型里面。 library(scde) cnts <- apply( counts, 2, function(x) { storage.mode(x) <- 'integer' return(x) } ) names(group) <- 1:length(group) colnames(cnts) <- 1:length(group) o.ifm <- scde::scde.error.models( counts = cnts, groups = group, n.cores = 1, threshold.segmentation = TRUE, save.crossfit.plots = FALSE, save.model.plots = FALSE, verbose = 0, min.size.entries = 2 ) priors <- scde::scde.expression.prior( models = o.ifm, counts = cnts, length.out = 400, show.plot = FALSE ) resSCDE <- scde::scde.expression.difference( o.ifm, cnts, priors, groups = group, n.randomizations = 100, n.cores = 1, verbose = 0 ) # Convert Z-scores into 2-tailed p-values pVals <- pnorm(abs(resSCDE$cZ), lower.tail = FALSE) * 2 pVals <- p.adjust(pVals, method = "fdr") sigDE <- names(pVals)[pVals < 0.05] SCDE_pVals=pVals DE_Quality_rate(sigDE) DE_Quality_AUC(pVals)
save(SCDE_pVals,BPSC_pVals,MAST_pVals,monocle_pVals,edgeR_pVals,Wilcox_pVals,ks_pVals,file = 'DEG_results.Rdata')
统计学基础负二项分布 negative binomial modelset.seed(1) hist(rnbinom(1000, mu=10, size=100), col="grey50", xlab="Read Counts", main="Negative Binomial")
这个是被应用的最广泛的转录组表达数据分布模型。但是对单细胞转录组测序数据来说,因为有很高的dropout情况,导致模型失准,所以就提出来了zero-inflated negative binomial models zero-inflated negative binomial modelsd = 0.5; counts <- rnbinom(1000, mu=10, size=100); counts[runif(1000) < d] = 0; hist(counts, col="grey50", xlab="Read Counts", main="Zero-inflated NB")
就是在原始的负二项分布数据里面随机挑选一些低表达量基因,给它们人为赋值为0表达量值。 Poisson-Beta distributiona = 0.1 b = 0.1 g = 100 lambdas = rbeta(1000, a, b) counts = sapply(g*lambdas, function(l) {rpois(1, lambda=l)}) hist(counts, col="grey50", xlab="Read Counts", main="Poisson-Beta")
文中提到的测试数据: http://www./jmzeng/scRNA/DESeq_table.rds http://www./jmzeng/scRNA/pollen.rds http://www./jmzeng/scRNA/tung_umi.rds http://www./jmzeng/scRNA/usoskin1.rds
|