文章复现:Integration of Single-Cell RNA Sequencing and Bulk RNA Sequencing Data to Establish and Validate a Prognostic Model for Patients With Lung Adenocarcinoma
全文数据分析,没有湿实验。在GEO数据库下载了一份scRNA的数据,两份基因芯片的数据,还下载了TCGA的数据。研究肺腺癌(LUAD),整合scRNA-seq和传统的RNA-seq数据来构建LUAD患者的预后模型,并采用两个外部验证队列来验证其风险分层能力。
文章整体思路如下:
单细胞降维分群注释,找出tumor和normal间关键细胞类型。单细胞功能富集,拟时序分析,细胞通讯分析 TCGA数据进行WGCNA分析,找到hub gene后,取hub gene和TCGA DEGs的交集,进行后续分析 单因素Cox回归鉴定潜在的预后DEGs,非负矩阵分解进行sample clustering(亚型),不同cluster的生存分析,免疫浸润分析 单因素Cox回归筛选出的gene用LASSO模型进一步筛选,筛选完后进行多因素Cox回归,计算每个样本的risk score,根据risk score分成两类,进行生存分析,将构建的模型应用于两个GEO芯片数据,进行同样分析 高危低危人群与临床信息相关性分析,分析risk score是否可作为预后指标,基因突变分析 本次复现只是为了复现图表,数据不一定具有生物学意义(大家可以理解为凑图或者灌水)
Step 1.单细胞数据处理 先下载好数据,一共4个样本就新建4个文件夹,名字命名成样本名,文件夹内带'GSM'开头的数据是下载的原始数据,重新复制一份,把文件名改成标准的read10x函数读入格式,一定注意名字不要写错。
一定要保证这3个文件同时存在,而且在同一个文件夹下面。
示例代码是:
rm(list=ls()) options(stringsAsFactors = F )library (Seurat) sce1 <- CreateSeuratObject(Read10X('../10x-results/WT/' ), "wt" )
重点就是 Read10X 函数读取 文件夹路径,比如:../10x-results/WT/ ,保证文件夹下面有3个文件。
1.1 QC library(Seurat) library(harmony) library(dplyr) library(stringr) rm(list = ls())# 多个数据读取与合并 rawdata_path <- './rawdata' filename <- list.files(rawdata_path) rawdata_path <- paste(rawdata_path,filename,sep = '/' ) sceList <- lapply(rawdata_path, function (x){ obj <- CreateSeuratObject(counts = Read10X(x), project = str_split(x,'/' )[[1]][3]) }) names(sceList) <- filename
这里sceList是一个列表,列表里共4个元素,每个元素都是一个seurat对象,每个元素的名字就是上面的样本名。rawdata_path就是上面下载的.gz文件 所在文件夹的路径,Read10X函数只需要传入文件夹路径就可以自动读入数。 project参数传入的是一个字符串,一般用样本名即可,这里用str_split函数分割路径字符串得到样本名
sce <- merge(sceList[[1]],sceList[-1],add.cell.ids = names(sceList),project = 'luad' ) sce@meta.data$group <- str_split(sce@meta.data$orig .ident,'_' ,simplify = T)[,2]
把4个seurat对象用merge函数合并后,看一下每个细胞的基本信息,这里我根据作者给出的信息,把疾病和正常分成两组,又增加了一列,ade1是疾病,nor是正常
# 查看线粒体(MT开头)、核糖体(RPS/RPL开头)、血红细胞所占比例 grep('^MT' ,x=rownames(sce@assays$RNA @data),value = T) grep('^RP[SL]' ,x=rownames(sce@assays$RNA @data),value = T) grep('^HB[^(P)]' ,x=rownames(sce@assays$RNA @data),value = T) sce <- PercentageFeatureSet(sce,'^MT' ,col.name = 'percent_MT' ) sce <- PercentageFeatureSet(sce,'^RP[SL]' ,col.name = 'percent_RP' ) sce <- PercentageFeatureSet(sce,'^HB[^(P)]' ,col.name = 'percent_HB' ) VlnPlot(sce,features = "nCount_RNA" ,pt.size = 0,y.max = 10000) VlnPlot(sce,features = "nFeature_RNA" ,pt.size = 0,y.max = 2500) VlnPlot(sce,features = "percent_MT" ,pt.size = 0) VlnPlot(sce,features = "percent_RP" ,pt.size = 0) VlnPlot(sce,features = "percent_HB" ,pt.size = 0,y.max = 0.1) VlnPlot(sce,features = c("nCount_RNA" ,"nFeature_RNA" ,"percent_MT" ),pt.size = 0,group.by = 'orig.ident' )
对于每个细胞,我又添加了三列信息:线粒体基因占比,核糖体基因占比,红细胞基因占比。这几列用来过滤数据,这里重新给sce赋值是因为,PercentageFeatureSet函数的返回值还是一个seurat对象,和原来的sce相比, 就只是在meta data里多了一列而已,所以直接重新赋值,把原来的seurat对象sce给更新掉 过滤前先看下数据,32538个gene,13060个细胞
# 过滤细胞 sce <- subset(sce,subset = nCount_RNA>1000 & nFeature_RNA>300 & percent_MT<25)# 过滤基因 sce <- sce[rowSums(sce@assays$RNA @counts>0)>3,]# 过滤线粒体、核糖体、血红细胞进占比高的细胞 sce <- subset(sce,subset = percent_MT<25 & percent_RP<30 & percent_HB<0.1)
过滤完后22249个gene,7892个细胞
# 细胞周期评分 # S.Score较高的为S期,G2M.Score较高的为G2M期,都比较低的为G1期 s_feature <- cc.genes.updated.2019$s .genes g2m_feature <- cc.genes.updated.2019$g2m .genes sce <- CellCycleScoring(sce, s.features = s_feature, g2m.features = g2m_feature, set.ident = T) VlnPlot(sce,features = c('S.Score' ,'G2M.Score' ),group.by = 'orig.ident' ,pt.size = 0) saveRDS(sce,'sce_qc.rds' )
再看一下细胞周期,把细胞周期两个评分加到meta data里,更新sce
1.2 用harmony去除批次效应 library(harmony) library(dplyr) library(Seurat) library(clustree) rm(list = ls()) sce <- readRDS('sce_qc.rds' ) sce <- NormalizeData(sce, normalization.method = 'LogNormalize' , scale.factor = 10000) sce <- FindVariableFeatures(sce, selection.method = "vst" , nfeatures = 2000)# 默认用variableFeature做scale sce <- ScaleData(sce) sce <- RunPCA(sce,features = VariableFeatures(sce)) DimPlot(sce,reduction = 'pca' ,group.by = 'group' )
从PCA降维的结果看,两个组的批次效应不大,均匀的混合在一起,但是这里还是用harmony试试
sce <- RunHarmony(sce,group.by.vars = 'orig.ident' ) ElbowPlot(sce,reduction = 'harmony' ) sce <- RunUMAP(sce,dims = 1:10,reduction = 'harmony' ) sce <- RunTSNE(sce,dims = 1:10,reduction = 'harmony' ) DimPlot(sce,reduction = 'umap' ,label = T,group.by = 'group' ) DimPlot(sce,reduction = 'tsne' ,label = T,group.by = 'group' )
选择orig.ident这一列,去除4个样本之间的批次效应。根据肘部图确定降维后的维度。
sce <- FindNeighbors(sce,reduction = 'harmony' ,dims = 1:10) sce_res <- FindClusters(sce_res,resolution = 0.8)# 设置不同的分辨率查看分组情况 sce_res <- scefor (i in c(0.01, 0.05, 0.1, 0.2, 0.3, 0.5,0.8,1)){ sce_res <- FindClusters(sce_res,resolution = i) } clustree(sce_res,prefix = 'RNA_snn_res.' )
FindNeighbors函数计算细胞间的距离,FindClusters函数确定分群结果。上面的UMAP看着大概分了7坨,但实际上并不是7个分群,实际分群数量还是要看FindClusters的结果。FindClusters函数分辨率不同,分群数量会不同,具体分多少需要摸索。这里分辨率选了0.5,对应第六行蓝色,13个分群,如果选0.8的话,感觉分群数量会突然增加好多,不太合适。
sce <- FindClusters(sce,resolution = 0.5) saveRDS(sce,file = 'step2_harmony.rds' ) DimPlot(sce,reduction = 'umap' ,group.by = 'orig.ident' ) DimPlot(sce,reduction = 'umap' ,group.by = 'seurat_clusters' )
确定好分辨率后,分群赋值给sce,在meta data里会多出来两列,这两列是一样的,表示每个细胞分在哪个cluster里,最后保存一下更新后的sce对象,再来看下分群结果
1.3 细胞注释 library(SingleR) library(celldex) library(Seurat) library(dplyr) library(stringr) library(pheatmap) library(ReactomeGSA) library(monocle) library(limma) library(ggplot2) library(msigdbr) library(singleseqgset) rm(list = ls()) gc() sce <- readRDS('step2_harmony.rds' )# singleR注释 hpca.se <- HumanPrimaryCellAtlasData() bpe.se <- BlueprintEncodeData() anno <- SingleR(sce@assays$RNA @data, ref = list(BP=bpe.se,HPCA=hpca.se), labels = list(bpe.se$label .main,hpca.se$label .main), clusters = sce@meta.data$seurat_clusters )
用SingleR自动注释,HumanPrimaryCellAtlasData和BlueprintEncodeData是SingleR自带的两个数据库,用两个数据库注释效果好像比一个会好点。得到的anno也是一个对象,每个细胞群的注释信息就藏在里面,13个分群被注释成7种细胞,其中有6个分群都是上皮细胞,也许上皮细胞还能继续细分。想继续细分的话,对sce取子集,把标签为上皮细胞的细胞取出来,降维分群,丢到SingleR()函数里再注释
plotScoreHeatmap(anno,clusters = anno@rownames,show_colnames = T)
再看下注释结果的打分,不满意的话也可以手动注释
sce@meta.data$singleR_label <- unlist(lapply(sce@meta.data$seurat_clusters , function (x){anno$labels [x]})) DimPlot(sce,reduction = 'umap' ,group.by = 'singleR_label' ,label = T) DimPlot(sce,reduction = 'tsne' ,group.by = 'singleR_label' ,label = T)
anno里面有每个分群对应的注释信息,把这个信息添加到sce的meta data里,然后就可以在分群的图上添加上注释信息了
1.4 鉴定关键细胞类型 cell_list <- split(colnames(sce), sce$singleR_label ) deg <- c()for (i in names(cell_list)){ sce_temp <- sce[,colnames(sce) %in % cell_list[[i]]] sce_temp <- CreateSeuratObject(counts = sce_temp@assays$RNA @counts, meta.data = sce_temp@meta.data, min.cells = 3, min.features = 200) sce_temp <- NormalizeData(sce_temp) sce_temp <- FindVariableFeatures(sce_temp) sce_temp <- ScaleData(sce_temp,vars.to.regress = c('nFeature_RNA' ,'"percent_MT"' )) Idents(sce_temp) <- sce_temp$group sce_temp_markers <- FindAllMarkers(object=sce_temp, only.pos = T, logfc.threshold = 2, min.pct = 0.1) sce_temp_markers <- sce_temp_markers[order(sce_temp_markers$cluster ,sce_temp_markers$avg_log2FC ),] deg_gene <- sce_temp_markers[sce_temp_markers$p_val_adj <0.05,'gene' ] deg <- c(deg,length(deg_gene)) } df <- data.frame(cell_type = names(cell_list),deg_gene = deg)
把每种类型的细胞分别拿出来,用FindAllMarkers找出ade1和nor组之间的差异基因
1.5 单细胞通路富集 temp <- sce Idents(temp) <- temp@meta.data$singleR_label reactome_obj <- analyse_sc_clusters(temp,create_reactome_visualization=T,use_interactors=F) path_ways <- pathways(reactome_obj) path_ways$diff <- rowMax(as.matrix(path_ways[,2:ncol(path_ways)])) - rowMin(as.matrix(path_ways[,2:ncol(path_ways)])) path_ways <- path_ways[order(path_ways$diff ,decreasing = T),] plot_gsva_heatmap(reactome_obj,rownames(path_ways)[1:10],margins = c(10,20)) saveRDS(sce,file = 'step3_sce.rds' )
用pathway()提取富集出来的结果,每行是一个通路,每列是一种细胞,计算每个通路在7种细胞中表达差异的diff,根据diff排序,画出前10个通路
1.6 拟时序分析 library(Seurat) library(monocle) sce <- readRDS('step3_sce.rds' )# 1. 将seurat对象转化为monocle的CDS对象 sparse_data <- as(as.matrix(sce@assays$RNA @counts),'sparseMatrix' ) mdata <- new('AnnotatedDataFrame' ,data=sce@meta.data) # 行为cell fData <- data.frame(gene_short_name=row.names(sparse_data),row.names = row.names(sparse_data)) fd <- new('AnnotatedDataFrame' ,data=fData) # 行为gene 列为gene description monocle_cds <- newCellDataSet(cellData = sparse_data, phenoData = mdata, featureData = fd, lowerDetectionLimit = 0.5, expressionFamily = negbinomial.size())# 2. 计算size factor和离散度 monocle_cds <- estimateSizeFactors(monocle_cds) monocle_cds <- estimateDispersions(monocle_cds)# 3.过滤低质量细胞 monocle_cds <- detectGenes(monocle_cds,min_expr = 0.1)# 4. 找高变基因、降维分群 # 用seurat找的高变基因 sce_var_gene <- VariableFeatures(sce) monocle_cds <- setOrderingFilter(monocle_cds,sce_var_gene) monocle_cds <- reduceDimension(monocle_cds,num_dim=10,norm_method = 'log' ,reduction_method = 'tSNE' ) monocle_cds <- clusterCells(monocle_cds,num_clusters = 10) plot_cell_clusters(monocle_cds,color_by = 'singleR_label' ) diff_test_gene <- differentialGeneTest(monocle_cds[sce_var_gene,],fullModelFormulaStr = '~singleR_label' ) diff_gene <- row.names(subset(diff_test_gene,qval<0.01)) monocle_cds <- setOrderingFilter(monocle_cds,diff_gene)# 5. 用reversed graph embedding降维 monocle_cds <- reduceDimension(monocle_cds,reduction_method = 'DDRTree' )# 6. 计算细胞拟时间 monocle_cds <- orderCells(monocle_cds)
这里运行到第六步,orderCells()函数大概率会报错,详细的解决办法可以参考知乎上的贴子https://zhuanlan.zhihu.com/p/589134519 ,这里说一下大概解决步骤
下载一个R文件,order_cells.R,链接在知乎贴子里,然后source调用 把orderCells()中用DDRTree计算的那部分代码单独拿出来,写成一个函数,后面直接调用,记得给出返回值。 或者不封装成函数的形式也行,那就不需要返回值了# 6. 计算细胞拟时间 source ('order_cells.R' ) library('igraph' ) my_ordercell <- function (cds){ root_state = NULL num_paths = NULL reverse = NULL root_cell <- select_root_cell(cds, root_state, reverse) cds@auxOrderingData <- new.env(hash = TRUE) if (cds@dim_reduce_type == "DDRTree" ) { if (is.null(num_paths) == FALSE) { message("Warning: num_paths only valid for method 'ICA' in reduceDimension()" ) } cc_ordering <- extract_ddrtree_ordering(cds, root_cell) pData(cds)$Pseudotime <- cc_ordering[row.names(pData(cds)), ]$pseudo_time K_old <- reducedDimK(cds) old_dp <- cellPairwiseDistances(cds) old_mst <- minSpanningTree(cds) old_A <- reducedDimA(cds) old_W <- reducedDimW(cds) cds <- project2MST(cds, project_point_to_line_segment) minSpanningTree(cds) <- cds@auxOrderingData[[cds@dim_reduce_type]]$pr_graph_cell_proj_tree root_cell_idx <- which (V(old_mst)$name == root_cell, arr.ind = T) cells_mapped_to_graph_root <- which (cds@auxOrderingData[["DDRTree" ]]$pr_graph_cell_proj_closest_vertex == root_cell_idx) if (length(cells_mapped_to_graph_root) == 0) { cells_mapped_to_graph_root <- root_cell_idx } cells_mapped_to_graph_root <- V(minSpanningTree(cds))[cells_mapped_to_graph_root]$name tip_leaves <- names(which (degree(minSpanningTree(cds)) == 1)) root_cell <- cells_mapped_to_graph_root[cells_mapped_to_graph_root %in % tip_leaves][1] if (is.na(root_cell)) { root_cell <- select_root_cell(cds, root_state, reverse) } cds@auxOrderingData[[cds@dim_reduce_type]]$root_cell <- root_cell cc_ordering_new_pseudotime <- extract_ddrtree_ordering(cds, root_cell) pData(cds)$Pseudotime <- cc_ordering_new_pseudotime[row.names(pData(cds)), ]$pseudo_time if (is.null(root_state) == TRUE) { closest_vertex <- cds@auxOrderingData[["DDRTree" ]]$pr_graph_cell_proj_closest_vertex pData(cds)$State <- cc_ordering[closest_vertex[, 1], ]$cell_state } } cds } monocle_cds <- my_ordercell(monocle_cds)
# 可以画图了 plot_cell_trajectory(monocle_cds,color_by = 'Pseudotime' ) plot_cell_trajectory(monocle_cds,color_by = 'State' ) plot_cell_trajectory(monocle_cds,color_by = 'singleR_label' ) plot_cell_trajectory(monocle_cds,color_by = 'singleR_label' )+facet_wrap(~singleR_label,nrow = 3) saveRDS(monocle_cds,file = 'monocle.rds' )
1.7 细胞通讯 library(CellChat) library(patchwork) library(Seurat) library(dplyr) library(ggalluvial) library(NMF) options(stringsAsFactors = FALSE) rm(list = ls()) gc() sce <- readRDS('step3_sce.rds' )# Step1.创建cellchat对象 data.input <- sce@assays$RNA @data meta.data <- sce@meta.data cellchat <- createCellChat(object=data.input, meta = meta.data, group.by='singleR_label' ) cellchat <- addMeta(cellchat,meta = meta.data)# Step2.加载CellChatDB数据库 cellchatDB <- CellChatDB.human cellchat@DB <- cellchatDB# Step3.处理表达数据 cellchat <- subsetData(cellchat) cellchat <- identifyOverExpressedGenes(cellchat) cellchat <- identifyOverExpressedInteractions(cellchat)# Step4.计算通讯概率,推断细胞通讯网络 这一步很耗时 cellchat <- computeCommunProb(cellchat,population.size = F) cellchat <- filterCommunication(cellchat,min.cells = 10)# Step5. 提取预测的细胞通讯网络为data frame df.net <- subsetCommunication(cellchat) df.pathway <- subsetCommunication(cellchat,slot.name = 'netP' )# Step6. 在信号通路水平推断细胞通讯 cellchat <- computeCommunProbPathway(cellchat)# Step7. 计算加和的cell-cell通讯网络 cellchat <- aggregateNet(cellchat) par(mfrow = c(1,2)) netVisual_circle(cellchat@net$count ,vertex.weight = groupSize, weight.scale = T,label.edge = F, title.name = 'number of Interaction' ) netVisual_circle(cellchat@net$weight ,vertex.weight = groupSize, weight.scale = T,label.edge = F, title.name = 'Interaction Weight' ) mat <- cellchat@net$weight par(mfrow = c(3,3),xpd=T)for (i in 1:nrow(mat)){ mat2 <- matrix(0,nrow = nrow(mat),ncol = ncol(mat), dimnames = dimnames(mat)) mat2[i,] <- mat[i,] netVisual_circle(mat2,vertex.weight = groupSize, weight.scale = T,edge.weight.max = max(mat), title.name = rownames(mat)[i]) }
# Step8. 使用层次图(Hierarchical plot),圆圈图(Circle plot)或和弦图(Chord diagram)可视化每个信号通路 cellchat@netP$pathways pathway.show <- c('SEMA4' ) vertex.receiver <- c(1,2,3,4) netVisual_aggregate(cellchat,signaling = pathway.show, vertex.receiver = vertex.receiver, layout = 'hierarchy' )
par(mfrow=c(1,2),xpd=T) netVisual_aggregate(cellchat,signaling = pathway.show, layout = 'circle' ) netVisual_aggregate(cellchat,signaling = pathway.show, layout = 'circle' ,label.edge=T)
pairLR.CXCL <- extractEnrichedLR(cellchat,signaling = pathway.show, geneLR.return = F) LR.show <- pairLR.CXCL[1,] netVisual_individual(cellchat,signaling = pathway.show, pairLR.use = 'SEMA4A_PLXNB2' , layout = 'circle' ) netVisual_aggregate(cellchat,signaling = pathway.show,layout = 'chord' , title.name = "Chord diagram 1" )
Step 2. TCGA差异分析 library(TCGAbiolinks) library(SummarizedExperiment) library(AnnoProbe) library(stringr) library(limma) library(tinyarray) library(data.table) library(ggplot2) library(dplyr) library(tibble) library(pheatmap) library(clusterProfiler) library(org.Hs.eg.db) rm(list = ls())# 下载LUAD表达矩阵 query <- GDCquery(project = 'TCGA-LUAD' , data.category = 'Transcriptome Profiling' , data.type = 'Gene Expression Quantification' , workflow.type = 'STAR - Counts' ) GDCdownload(query) dat <- GDCprepare(query) exp <- assay(dat) save(exp,file='TCGA_LUAD_exp.Rdata' )# 下载LUAD临床信息 query <- GDCquery(project = 'TCGA-LUAD' , data.category = 'Clinical' , data.type = 'Clinical Supplement' , file.type = 'xml' ) GDCdownload(query) dat <- GDCprepare_clinic(query,clinical.info = 'patient' ) k = apply(dat, 2, function (x){!all(is.na(x))});table(k) clinical <- dat[,k] save(clinical,file = 'TCGA_LUAD_Clinical.Rdata' )
注意一下TCGA的样本命名含义:
以这个为例:TCGA-A6-6650-01A-11R-1774-07
A6 :Tissue source site,组织来源编码
6650 : Participant, 参与者编号
01 : Sample, 编号01~09表示肿瘤,10~19表示正常对照
A : Vial, 在一系列患者组织中的顺序,绝大多数样本该位置编码都是A;
很少数的是B,表示福尔 马林固定石蜡包埋组织,已被证明用于测序分析的效果不佳,所以不建议使用-01B的样本数据
11 : Portion, 同属于一个患者组织的不同部分的顺序编号,同一组织会分割为100-120mg的部分,分别使用
R : RNA, 分析的分子类型,对应关系如下所示:
1774 : Plate, 在一系列96孔板中的顺序,值大表示制板越晚
07 : Center, 测序或鉴定中心编码
# count 转换成TPM gene_length <- fread('gene_len.txt' ) colnames(gene_length) <- c('id' ,'length' ) gene_length <- as.data.frame(gene_length) gene_length <- gene_length[!duplicated(gene_length$id ),] exp <- as.data.frame(exp) exp$id <- unlist(lapply(rownames(exp),function (x){str_split(x,'\\.' ,simplify = T)[,1]})) merge <- left_join(x=exp,y=gene_length,by='id' ) merge <- na.omit(merge) merge <- merge[!duplicated(merge$id ),] rownames(merge) <- merge$id merge <- merge[,-601]# 计算TPM kb <- merge$length / 1000 rpk <- merge[,-601] / kb tpm <- t(t(rpk) / colSums(rpk) * 1E6) exp <- tpm
采用limma做差异分析,先把counts转换成TPM。不转也可以用DEseq2做差异分析。gene_length是每个基因的长度,这个文件是把gtf文件中,基因的终止位置-起始位置得到的。
# 更改行名:ENSEMBL -> SYMBOL a <- annoGene(str_split(rownames(exp),'\\.' ,simplify = T)[,1],'ENSEMBL' ) a <- a[!duplicated(a$ENSEMBL ),] rownames(a) <- a$ENSEMBL rownames(exp) <- a[str_split(rownames(exp),'\\.' ,simplify = T)[,1],'SYMBOL' ]# 给样本分组 tumor/normal group <- ifelse(sapply(str_split(colnames(exp),'-' ,simplify = T)[,4],function (x){as.integer(substr(x,1,2))})>=10, 'normal' ,'tumor' ) sample_group <- data.frame(sample=colnames(exp),group=group)
a是annoGene函数返回的一个ID转换的结果,是一个dataframe,多个ENSEMBL ID可能对应同一个SYMBOL ID,对symbol去重后,再映射成exp的行名。 按照TCGA名字中第四部分,把样本分成tumor和normal
# 过滤 # 1. 去除FFPE样本 sample_group$sample_type <- ifelse(sapply(str_split(colnames(exp),'-' ,simplify = T)[,4],function (x){substr(x,3,3)})=='A' , 'tissue' ,'ffpe' ) exp <- exp[,sample_group$sample_type !='ffpe' ] sample_group <- sample_group[sample_group$sample_type =='tissue' ,]# 2. 保留在一半以上样本中有值的gene exp <- exp[rowSums(exp>0)>as.integer(ncol(exp)/2),] dim(exp) save(exp,file = 'TCGA_LUAD_TPM_exp.Rdata' )
FFPE样本无论做测序还是蛋白组,代谢组,效果都不怎么好
# 对表达矩阵取log2 for (i in 1:ncol(exp)){ exp[,i] <- log2(exp[,i]+0.0001) }
# 差异筛选 design <- model.matrix(~factor(sample_group$group ,levels = c('tumor' ,'normal' ))) fit <- lmFit(exp,design = design) fit <- eBayes(fit) deg <- topTable(fit, coef = 2,number = Inf) deg$change <- ifelse((deg$logFC >log2(2))&(deg$adj .P.Val<0.05),'up' , ifelse((deg$logFC <log2(0.5))&(deg$adj .P.Val<0.05),'down' ,'nochange' )) deg <- na.omit(deg) table(deg$change )
做完差异筛选,有10个gene含空值,删去有空值的行即可。更新下exp并保存好
# deg比exp少了几个gene exp <- exp[intersect(rownames(exp),deg$ID ),] save(exp,file = 'TCGA_exp.Rdata' ) save(deg,file = 'TCGA_DEGS.Rdata' )
# volcano plot ggplot(deg,aes(x=deg$logFC ,y=-log10(deg$adj .P.Val),color=change))+ geom_point()+ geom_hline(yintercept = -log10(0.05),linetype=2,linewidth=1)+ geom_vline(xintercept = log2(0.5),linetype=2,linewidth=1)+ geom_vline(xintercept = log2(2),linetype=2,linewidth=1)+ scale_color_manual(values=c('blue' ,'gray' ,'red' ))# heatmap deg_heatmap <- deg[order(deg$logFC ,decreasing = T),] deg_gene <- deg_heatmap[deg_heatmap$change !='nochange' ,'ID' ] heatmap_deg_gene <- as.data.frame(exp)[deg_gene,] heatmap_deg_gene <- na.omit(heatmap_deg_gene) anno_col <- data.frame(group=sample_group$group ) rownames(anno_col) <- sample_group$sample pheatmap(heatmap_deg_gene,show_rownames = F,cluster_rows = F,scale = 'row' , show_colnames = F,annotation_col=anno_col , breaks = seq(-2.5,2.5,length.out=100)) dim(heatmap_deg_gene)# pca plot exp <- as.data.frame(exp) pca <- prcomp(t(exp)) ggplot(data.frame(pca$x ),aes(x=PC1,y=PC2,color=sample_group$group ))+ geom_point(size = 3)+stat_ellipse(level = 0.95, show.legend = F)+ labs(title = 'PCA Analysis' )+ theme(plot.title = element_text(hjust = 0.5))+ guides(color = guide_legend(title = 'Group' ))
# GO enrich go_bp <- enrichGO(deg_gene,org.Hs.eg.db,'SYMBOL' ,'BP' ) go_mf <- enrichGO(deg_gene,org.Hs.eg.db,'SYMBOL' ,'MF' ) go_cc <- enrichGO(deg_gene,org.Hs.eg.db,'SYMBOL' ,'CC' ) dotplot(go_bp) barplot(go_bp,showCategory = 4)# KEGG enrich entrez_id <- bitr(deg_gene,fromType = 'SYMBOL' ,toType = 'ENTREZID' ,OrgDb = org.Hs.eg.db) kegg <- enrichKEGG(entrez_id$ENTREZID ,organism = 'hsa' ,keyType = 'ncbi-geneid' ) dotplot(kegg) barplot(kegg,showCategory = 4)
Step 3. WGCNA分析 library(WGCNA) library(reshape2) library(stringr) library(dplyr) library(patchwork) library(ggplot2) rm(list = ls()) gc() load('TCGA_DEGS.Rdata' ) load('TCGA_exp.Rdata' ) load('TCGA_LUAD_Clinical.Rdata' ) options(stringsAsFactors = F) # 开启多线程 enableWGCNAThreads()
# 1. 数据处理 # >>> 选取表达量前10000的gene,用全部gene的话,计算后面的邻接矩阵和TOM矩阵太慢了 temp <- as.data.frame(exp) temp$avg_exp <- rowMeans(temp) temp <- temp[order(temp$avg_exp ,decreasing = T),] top10000_gene <- rownames(temp[1:10000,]) exp <- as.data.frame(exp[top10000_gene,])# >>> 查找并删除异常值 temp <- exp colnames(temp) <- 1:ncol(exp) plot(hclust(dist(t(temp)))) # 没有异常样本
# 2. 构建基因共表达网络 exp <- as.data.frame(t(exp))# >>> 2.1 选择合适的power power_vec <- c(seq(1, 10, by = 1), seq(12, 20, by = 2)) sft <- pickSoftThreshold(exp,powerVector = power_vec) a <- ggplot(sft$fitIndices ,aes(x=Power,y=-sign(slope)*SFT.R.sq))+ geom_text(label=sft$fitIndices $Power ,color='red' )+ geom_hline(yintercept = 0.9,color='red' )+ xlab("Soft Threshold (power)" )+ylab("Scale Free Topology Model Fit,signed R^2" )+ ggtitle("Scale independence" )+theme(plot.title = element_text(hjust = 0.5)) b <- ggplot(sft$fitIndices ,aes(x=Power,y=mean.k.))+ geom_text(label=sft$fitIndices $Power ,color='red' )+ xlab("Soft Threshold (power)" )+ylab("Mean Connectivity" )+ ggtitle('Mean connectivity' )+theme(plot.title = element_text(hjust = 0.5)) a+b
这里power=12,达到0.8即可
power <- sft$powerEstimate # >>> 2.2 检验所选power是否符合无尺度网络 k <- softConnectivity(exp,power = power) scaleFreePlot(k)
power=12确实也符合无尺度网络,R2较高
# >>> 2.3 计算gene在sample间的corr或者distance,构建临近矩阵。默认用dist函数计算(欧氏距离) adj <- adjacency(exp,power = power)# >>> 2.4 根据临近矩阵计算TOM(拓扑重叠矩阵) TOM <- TOMsimilarity(adj)# >>> 2.5 计算gene间相异性 dissTOM <- 1-TOM
# 3. 对gene进行聚类并可视化 geneTree <- hclust(as.dist(dissTOM),method = 'average' ) sizeGrWindow(12,9) plot(geneTree, xlab="" , sub="" , main = "Gene clustering on TOM-based dissimilarity" , labels = FALSE, hang = 0.04)# >>> 3.2 将gene聚类树进行修剪后,把gene归到不同的模块里 dynamic_modules <- cutreeDynamic(dendro = geneTree,distM = dissTOM,minClusterSize = 50, deepSplit = 2,pamRespectsDendro=F,minSplitHeight = NULL) table(dynamic_modules) module_color <- labels2colors(dynamic_modules) plotDendroAndColors(geneTree,module_color,dendroLabels = F,hang=0.03,addGuide = T, guideHang = 0.05,groupLabels='Dynamic color tree' , main='gene dendrogram and module colors' )
# >>> 3.3 计算每个模块的特征基因向量(对每个模块的gene表达量降维后,只保留PC1,即MEs) MEList <- moduleEigengenes(exp,colors = dynamic_modules) MEs <- MEList$eigengenes # >>> 3.4 计算特征基因的相异度,基于相异度对原有模块进行层次聚类 dissME <- 1-cor(MEs) MEtree <- hclust(as.dist(dissME),method = 'average' ) plotEigengeneNetworks(MEs, "Eigengene adjacency heatmap" , marHeatmap = c(3,4,2,2), plotDendrograms = FALSE, xLabelsAngle = 90) # >>> 3.5 如果模块太多,可以对模块进行合并.设置剪切高度,剪切高度以下的模块进行合并 cut_height <- 0.2 sizeGrWindow(8,6) plot(MEtree,main = "Clustering of module eigengenes" , xlab = "" , sub = "" ) abline(h=cut_height,col='red' )
把高度0.2以下的模块进行合并,再看看合并前后模块的变化
module_merge <- mergeCloseModules(exp,module_color,cutHeight = cut_height) merged_color <- module_merge$colors merged_MEs <- module_merge$newMEs sizeGrWindow(8,6) plotDendroAndColors(geneTree,cbind(module_color,merged_color),dendroLabels = F,hang=0.03, addGuide = T,groupLabels = c("Dynamic Tree Cut" , "Merged dynamic" ))
# 4. 计算每个模块与临床信息的相关性、pvalue。找出与临床信息相关性高的模块 nGenes <- ncol(exp) nSamples <- nrow(exp)################################准备临床信息#################################### clinical <- clinical[,c('bcr_patient_barcode' ,'gender' ,'residual_tumor' )] traits <- data.frame(row.names = rownames(exp),ID=rownames(exp)) traits$ID <- substr(traits$ID ,1,12) index <- match(traits$ID ,clinical$bcr_patient_barcode ) traits$gender <- as.numeric(as.factor(clinical[index,'gender' ])) traits$residual_tumor <- as.numeric(as.factor(clinical[index,'residual_tumor' ])) traits$type <- ifelse(sapply(str_split(rownames(traits),'-' ,simplify = T)[,4], function (x){as.integer(substr(x,1,2))})>=10,'normal' ,'tumor' ) traits$type <- as.numeric(as.factor(traits$type )) traits <- traits[,-1]################################准备临床信息####################################
clinical是前面下载的TCGA临床数据,这里随便取了两个比较关心的临床信息。clinical的每一行是一个病人的ID,而表达矩阵每一行是一个样本,一个病人身上可能有多个样本。在把临床信息和表达矩阵一一对应的时候,如果直接把exp的行名取前12位,就会产生重复值。在做的时候,先准备好样本和病人ID的对应关系,再把clinical里病人的信息给匹配过来 最终的临床信息是这样,要把字符串变成数值型变量
module_trait_cor <- cor(merged_MEs,traits,use = 'p' ) module_trait_pvalue <- corPvalueStudent(module_trait_cor,nSamples) trait_colors <- numbers2colors(traits,signed = T,centered = T) a <- paste(signif(module_trait_cor, 2), "\n(" , signif(module_trait_pvalue, 1), ")" , sep = "" ) sizeGrWindow(8,6) labeledHeatmap(module_trait_cor,xLabels = colnames(traits),yLabels = colnames(merged_MEs),colorLabels = FALSE, colors = blueWhiteRed(50),textMatrix = a, setStdMargins = FALSE,cex.text = 0.65,zlim = c(-1,1), main = paste("Module-trait relationships" ))
模块与tumor/normal之间的相关性并不是很高。。。。。就拿粉色模块当做感兴趣的模块吧
# 5. 找出感兴趣的模块内的枢纽gene # >>> 5.1 计算gene表达量与临床性状的相关性 gs <- apply(exp,2,function (x){cor(x,traits$type )})# >>> 5.2 基于网络计算gene与临床性状的相关性 ns <- networkScreening(y=traits$type ,datME = merged_MEs,datExpr = exp)# >>> 5.3 计算每个gene与特征gene的相关性 kme <- signedKME(datExpr = exp,datME = merged_MEs)# >>> 5.4 根据gs, ns, kme 找出hub gene hubgene_filter <- (gs>0.2 & ns$q .Weighted<0.01 & abs(kme$kMEmagenta )>0.4) hub_gene <- colnames(exp)[hubgene_filter] plotNetworkHeatmap(exp, plotGenes = hub_gene, networkType = "unsigned" , useTOM = TRUE, power=8, main="unsigned correlations" )
看下kme,其中粉色模块里,gene的相关性并不是很高,这里筛选了>0.4的,最后找到25个hub gene。如果阈值再设高一点,hub gene就全卡掉了。。。这跟没有用所有的3w多个gene去计算有关,也跟一些参数有关,我没有一个个去调整优化,只是展示一下如何做图和分析思路
hub_gene <- intersect(hub_gene,deg$ID ) save(hub_gene,file = 'TCGA_WGCNA_hub_gene.Rdata' )
最后找出hub gene和前面deg的交集部分,作为hub gene,存档
Step 4. 非负矩阵分解算法 4.1 NMF算法对样本分类 library(NMF) library(survival) library(stringr) library(data.table) library(survminer) library(dplyr) library(AnnoProbe) library(CIBERSORT) library(tidyr) library(tibble) library(RColorBrewer) library(ggalluvial) library(ggplot2) options(stringsAsFactors = F) rm(list = ls()) gc()
load('TCGA_LUAD_TPM_exp.Rdata' ) load('TCGA_DEGS.Rdata' ) load('TCGA_WGCNA_hub_gene.Rdata' ) dim(exp)# 对表达矩阵取log2 exp <- log2(exp+1) ranks <- seq(2,10) hubgene_exp <- exp[hub_gene,] estimate <- nmf(hubgene_exp,2:10,nrun=50,method = 'brunet' ) plot(estimate)
注意hubgene_exp里不能有负数,不然会报错,rank为2时,波动最大
rank <- 2 nmf.rank4 <- nmf(hubgene_exp,rank,nrun=50,method = 'brunet' ) index <- extractFeatures(nmf.rank4,'max' ) index <- unlist(index) nmf.exp <- hubgene_exp[index,] nmf.exp <- na.omit(nmf.exp) group <- predict(nmf.rank4) jco <- c("#2874C5" ,"#EABF00" ,"#C6524A" ,"#868686" ) consensusmap(nmf.rank4,labRow=NA,labCol=NA, annCol = data.frame("cluster" =group[colnames(nmf.exp)]), annColors = list(cluster=c("1" =jco[1],"2" =jco[2],"3" =jco[3],"4" =jco[4]))) cluster_df <- data.frame(row.names = colnames(exp),ID=substr(colnames(exp),1,12),cluster=group)
NMF算法把所有样本分成两类,顺便看下最后的clueter_df
4.2 生存分析 load('TCGA_LUAD_TPM_exp.Rdata' ) load('TCGA_LUAD_Clinical.Rdata' )# 1. 取log(TPM) for (i in 1:ncol(exp)){ exp[,i] <- log2(exp[,i]+0.0001) }# 2. 以病人为中心,去重 exp <- as.data.frame(exp) exp <- exp[,sort(colnames(exp))] index <- !duplicated(substr(colnames(exp),1,12)) exp <- exp[,index]
# 3.读取从xena下载的生存数据 sur <- fread('TCGA-LUAD.survival.tsv' ) sur <- as.data.frame(sur) patient_id <- substr(colnames(exp),1,12) clin <- data.frame(row.names = patient_id,ID=patient_id) index <- match(clin$ID ,sur[,'_PATIENT' ]) clin <- cbind(clin,sur[index,c('OS' ,'OS.time' )]) clin <- na.omit(clin) index2 <- match(clin$ID ,clinical$bcr_patient_barcode ) clin <- cbind(clin,clinical[index2,'gender' ]) colnames(clin) <- c(colnames(clin)[1:ncol(clin)-1],'gender' ) clin$gender <- as.numeric(as.factor(clin$gender ))
下载下来的生存数据长这样 之前从TCGA下载的临床数据长这样 合并后的clin是这样,所有变量类型都要转成数值型
# 合并NMF结果 cluster_type_group <- ifelse(sapply(str_split(rownames(cluster_df),'-' ,simplify = T)[,4], function (x){as.integer(substr(x,1,2))} )>=10, 'normal' ,'tumor' ) cluster_df <- cluster_df[which (cluster_type_group=='tumor' ),] clin <- left_join(clin,cluster_df,'ID' ,multiple='first' )
再把NMF分类的结果添加进来,就是cluster那一列
# 7. 过滤生存数据,不要<30天的,再把天数转换成月 clin <- clin[clin$OS .time>30,] clin$OS .time <- clin$OS .time / 30# 8. exp的列名和临床信息的行名一一对应。NMF将样本分成若干个亚型,画出亚型的生存曲线 colnames(exp) <- substr(colnames(exp),1,12) index <- match(clin$ID ,colnames(exp)) exp <- exp[,index] sfit <- survfit(Surv(OS.time,OS)~cluster,data=clin) ggsurvplot(sfit,pval = T) save(clin,file = 'clin.Rdata' )
4.3 免疫浸润 my_pallet <- colorRampPalette(brewer.pal(8,'Set1' )) lm22f = system.file("extdata" , "LM22.txt" , package = "CIBERSORT" ) cluster1 <- exp[,rownames(cluster_df[cluster_df$cluster ==1,])] cluster2 <- exp[,rownames(cluster_df[cluster_df$cluster ==2,])] TME.result1 <- cibersort(lm22f,as.matrix(cluster1),perm = 100,QN=T) # 耗时很久 TME.result2 <- cibersort(lm22f,as.matrix(cluster2),perm = 100,QN=T)
原文中免疫浸润用MCPcounter做的,这里我用CIBERSORT做,运行完的结果就是各种免疫细胞在每个样本中的含量
result1 <- TME.result1[,-c(23:25)] %>% as.data.frame() %>% rownames_to_column('Sample' ) %>% gather(key = cell_type,value = proportion,-Sample) result2 <- TME.result2[,-c(23:25)] %>% as.data.frame() %>% rownames_to_column('Sample' ) %>% gather(key = cell_type,value = proportion,-Sample) result1$cluster <- rep('cluster1' , times =nrow(result1)) result2$cluster <- rep('cluster2' , times =nrow(result2)) dat <- rbind(result1,result2) ggplot(dat,aes(axis1 = cluster, axis2 = cell_type))+ geom_alluvium(aes(fill = as.factor(cell_type)),width = 2/5, discern = FALSE)+ geom_stratum(width = 2/5, discern = FALSE)+ geom_text(stat = "stratum" , discern = FALSE,aes(label = after_stat(stratum)))+ theme(legend.position = "none" ,#去除刻度线和背景颜色 panel.background = element_blank(), axis.ticks = element_blank(), axis.text.y = element_blank(), axis.text.x = element_text(size =15,colour = "black" ),#坐标轴名 axis.title = element_blank()) + scale_x_discrete(position = "top" )
把上面的透视表转换成长表,添加上cluster的信息,再把cluster1和cluster2对应的长表合并起来,就可以画图了
cell <- 'Plasma cells' ggviolin(dat[dat$cell_type ==cell,],x='cluster' ,y='proportion' ,fill = 'cluster' , add = 'boxplot' ,add.params = list(fill = 'white' ,width=0.1,linetype=1),title = cell)+ stat_compare_means(method = 't.test' ,label = 'p.format' ,aes(label = 'p.format' ), label.x.npc = 'center' ,label.y = 0.5,size=5)
4.4 批量单因素Cox回归 load('TCGA_WGCNA_hub_gene.Rdata' ) load('TCGA_LUAD_TPM_exp.Rdata' ) load('clin.Rdata' ) hubgene_exp <- exp[hub_gene,] hubgene_exp <- as.data.frame(t(hubgene_exp)) hubgene_exp$ID <- substr(rownames(hubgene_exp),1,12) clin <- left_join(clin,hubgene_exp,by='ID' )
把WGCNA中找到的hub gene挑出来,拿到他们在所有样本中的表达值。把表达值和生存信息合并到一起,最后的clin长这样。在合并之前要注意,删掉normal样本,我这里没有做。
res.cox <- coxph(Surv(OS.time,OS)~SAPCD2,data=clin) sumres.cox <- summary(res.cox)
挑一个基因SAPCD2做单因素cox回归,最后需要的信息都藏在sumres.cox里
# 批量单因素cox回归 gene <- c() p_value <- c() HR <- c() lower95 <- c() upper95 <- c()for (i in 6:ncol(clin)){ res <- coxph(Surv(OS.time,OS)~clin[,i],data=clin) sum_res <- summary(res) p <- sum_res$coefficients [,'Pr(>|z|)' ] if (p<0.05){ gene <- c(gene,colnames(clin)[i]) p_value <- c(p_value,p) HR <- c(HR,sum_res$conf .int[,'exp(coef)' ]) lower95 <- c(lower95,sum_res$conf .int[,'lower .95' ]) upper95 <- c(upper95,sum_res$conf .int[,'upper .95' ]) } } cox_df <- data.frame(row.names = gene,pvalue=p_value,HR=HR,lower.95=lower95,upper.95=upper95) lasso_input_gene <- rownames(cox_df[cox_df$pvalue <0.01,]) save(lasso_input_gene,file = 'Lasso_input_gene.Rdata' )
批量单因素cox回归得到每个gene的pvalue,挑出pvalue<0.01的再进行后续筛选
Step 5. 预后模型构建和验证 load('Lasso_input_gene.Rdata' ) load('TCGA_LUAD_TPM_exp.Rdata' ) load('clin.Rdata' )# 对表达矩阵取log2 exp <- log2(exp+0.00001) exp <- as.data.frame(exp)# Lasso回归进一步筛选gene exp2 <- exp[lasso_input_gene,] exp2 <- as.data.frame(t(exp2)) exp2$ID <- substr(rownames(exp2),1,12) exp2 <- exp2[!duplicated(exp2$ID ),] clin <- clin[,1:3] exp3 <- left_join(clin,exp2,by='ID' ) rownames(exp3) <- exp3$ID exp3 <- exp3[,-1] exp3 <- na.omit(exp3)
exp3是生存数据和上一步批量单因素Cox回归筛选出来的基因的表达量合并起来的结果
x <- as.matrix(exp3[,3:(ncol(exp3)-1)]) y <- Surv(exp3$OS .time,exp3$OS ) fit1 <- glmnet(x,y,alpha = 1,family = 'cox' ) plot(fit1) cv_fit <- cv.glmnet(x,y,alpha = 1,nfolds = 10,family="cox" ) plot(cv_fit) best_lambda <- cv_fit$lambda .min fit2 <- glmnet(x,y,lambda = best_lambda,alpha = 1,family = 'cox' ) lasso_filter_gene <- names(fit2$beta [fit2$beta [,1]!=0,1])
由于之前筛选出来的gene效果不怎么好, 经过lasso过滤后,只剩2个gene了,就用这两个来做多因素cox回归
# 多因素cox回归构建预后模型 exp4 <- exp3 %>% dplyr::select(OS,OS.time,lasso_filter_gene) multicox <- coxph(Surv(OS.time,OS)~.,data = exp4) sum_multicox <- summary(multicox) riskScore <- predict(multicox,type = 'risk' ,newdata = exp4) riskScore <- as.data.frame(riskScore) riskScore$ID <- rownames(riskScore) riskScore$risk <- ifelse(riskScore$riskScore >median(riskScore$riskScore ),'high' ,'low' )
做完多因素回归后,计算每个样本的riskscore,用riskScore把样本分成两类
# KM plot km_data <- left_join(riskScore,clin,by='ID' ) fit <- survfit(Surv(OS.time,OS)~risk,data=km_data) ggsurvplot(fit,data = km_data,pval = T,risk.table = T,surv.median.line = 'hv' , legend.title = 'RiskScore' ,title = 'Overall survival' , ylab='Cummulative survival' ,xlab='Time(Days)' )
# 绘制ROC曲线 roc_data <- data.frame(x = roc$FP [,1], y = roc$TP [,1], time = roc$times [1])for (i in 2:length(roc$times )) { temp <- data.frame(x = roc$FP [,i], y = roc$TP [,i], time = roc$times [i]) roc_data <- rbind(roc_data, temp) } ggplot(roc_data, aes(x = x, y = y, color = as.factor(time))) + geom_line() + scale_color_manual(values = c("gray" , "blue" , "black" )) + geom_abline(slope = 1, intercept = 0, linetype = "dashed" ) + xlab("1 - Specificity" ) + ylab("Sensitivity" ) + ggtitle("ROC Curve" ) + theme_minimal()
# 绘制风险图 ggrisk(multicox)
# risk score与临床病理特征的相关性 load('TCGA_LUAD_Clinical.Rdata' ) clinical_df <- clinical[,c('bcr_patient_barcode' ,'gender' ,'stage_event_pathologic_stage' )] clinical_df <- merge(km_data,clinical_df,by.x = 'ID' ,by.y = 'bcr_patient_barcode' ) clinical_df <- clinical_df[!duplicated(clinical_df$ID ),] clinical_df$stage_event_pathologic_stage <- case_when(clinical_df$stage_event_pathologic_stage =='Stage IA' ~'Stage I' , clinical_df$stage_event_pathologic_stage =='Stage IB' ~'Stage I' , clinical_df$stage_event_pathologic_stage =='Stage IIA' ~'Stage II' , clinical_df$stage_event_pathologic_stage =='Stage IIB' ~'Stage II' , clinical_df$stage_event_pathologic_stage =='Stage IIIA' ~'Stage III' , clinical_df$stage_event_pathologic_stage =='Stage IIIB' ~'Stage III' , ) ggviolin(clinical_df,x='gender' ,y='riskScore' ,fill = 'gender' , add = 'boxplot' ,add.params = list(fill = 'white' ,width=0.1,linetype=1),)+ stat_compare_means(method = 'wilcox.test' ,label = 'p.format' ,aes(label = 'p.format' ), label.x.npc = 'center' ,label.y = 20,size=5) ggviolin(clinical_df,x='stage_event_pathologic_stage' ,y='riskScore' ,fill = 'stage_event_pathologic_stage' , add = 'boxplot' ,add.params = list(fill = 'white' ,width=0.1,linetype=1),)+ stat_compare_means(method = 'wilcox.test' ,label = 'p.signif' ,aes(label = 'p.format' ), label.x = 1,label.y = 20,size=5)
Step 6. riskScore和临床信息的相关性 # 森林图 # 批量单因素回归 df_forest <- clinical_df df_forest$gender <- as.numeric(as.factor(df_forest$gender )) df_forest$stage_event_pathologic_stage <- as.numeric(as.factor(df_forest$stage_event_pathologic_stage )) gene <- c() p_value <- c() HR <- c() lower95 <- c() upper95 <- c() means <- c()for (i in c('riskScore' ,'gender' ,'stage_event_pathologic_stage' )){ res <- coxph(Surv(OS.time,OS)~df_forest[,i],data=df_forest) sum_res <- summary(res) p <- sum_res$coefficients [,'Pr(>|z|)' ] gene <- c(gene,i) p_value <- c(p_value,p) HR <- c(HR,sum_res$conf .int[,'exp(coef)' ]) lower95 <- c(lower95,sum_res$conf .int[,'lower .95' ]) upper95 <- c(upper95,sum_res$conf .int[,'upper .95' ]) means <- c(means,res$means ) } cox_df <- data.frame(row.names = gene,pvalue=p_value,HR=HR,lower.95=lower95,upper.95=upper95,means=means) cox_df$pvalue <- round(cox_df$pvalue ,3) cox_df$HR <- round(cox_df$HR ,3) cox_df <- rownames_to_column(cox_df,var = 'Variable' ) cox_df <- rbind(colnames(cox_df),cox_df) forestplot(labeltext=as.matrix(cox_df)[,1:3],lower=as.numeric(cox_df[,4]), upper=as.numeric(cox_df[,5]),mean=as.numeric(cox_df[,6]), is.summary=c(T,F,F,F), zero=0,lineheight=unit(0.5,'cm' ),graphwidth=unit(35,'mm' ), colgap=unit(2,'mm' ),lwd.zero=2,lwd.ci=3, col=fpColors(box = '#7AC5CD' ,lines = 'black' ,zero = 'purple' ), lwd.xaxis=2,lty.ci='solid' ,ci.vertices.height=0.05,graph.pos=2, xlim=c(0,2.5),box.size=0.5, hrzl_lines=list('2' =gpar(lwd=3,columns=1:4,col='dark green' ), '5' =gpar(lwd=3,columns=1:4,col='dark green' )) )# 多因素回归 res <- coxph(Surv(OS.time,OS)~riskScore+gender+stage_event_pathologic_stage,data=df_forest) sum_res <- summary(res) df <- data.frame(row.names = rownames(sum_res$conf .int),pvalue=sum_res$coefficients [,'Pr(>|z|)' ], HR=sum_res$conf .int[,'exp(coef)' ],lower=sum_res$conf .int[,'lower .95' ], upper=sum_res$conf .int[,'upper .95' ],means=res$means ) df$pvalue <- round(df$pvalue ,3) df$HR <- round(df$HR ,3) df <- rownames_to_column(df,var='Variable' ) df <- rbind(colnames(df),df) forestplot(labeltext=as.matrix(df)[,1:3],lower=as.numeric(df[,4]), upper=as.numeric(df[,5]),mean=as.numeric(df[,6]), is.summary=c(T,F,F,F), zero=0,lineheight=unit(0.5,'cm' ),graphwidth=unit(35,'mm' ), colgap=unit(2,'mm' ),lwd.zero=2,lwd.ci=3, col=fpColors(box = '#7AC5CD' ,lines = 'black' ,zero = 'purple' ), lwd.xaxis=2,lty.ci='solid' ,ci.vertices.height=0.05,graph.pos=2, xlim=c(0,2.5),box.size=0.5, hrzl_lines=list('2' =gpar(lwd=3,columns=1:4,col='dark green' ), '5' =gpar(lwd=3,columns=1:4,col='dark green' )) )
# GSEA df_exp <- as.data.frame(t(exp)) df_exp <- rownames_to_column(df_exp,var = 'sample_ID' ) df_exp$sample_ID <- substr(df_exp$sample_ID ,1,12) df_exp <- df_exp[!duplicated(df_exp$sample_ID ),] rownames(df_exp) <- df_exp$sample_ID df_exp <- as.data.frame(t(df_exp[,-1])) df_exp <- df_exp[,km_data$ID ] group <- as.factor(km_data$risk ) design <- model.matrix(~km_data$risk ) fit <- lmFit(df_exp,design) fit <- eBayes(fit) df_deg <- topTable(fit,coef = 2,number = Inf) df_deg$change <- ifelse(df_deg$logFC >log2(2) & df_deg$adj .P.Val<0.05,'up' , ifelse(df_deg$logFC <log2(0.5) & df_deg$adj .P.Val<0.05,'down' ,'nochange' )) df_deg <- df_deg[order(df_deg$logFC ,decreasing = T),] symbol_2_entrez <- bitr(rownames(df_deg),fromType = 'SYMBOL' ,toType = 'ENTREZID' ,OrgDb = org.Hs.eg.db) symbol_2_entrez <- symbol_2_entrez[!duplicated(symbol_2_entrez$SYMBOL ),] rownames(symbol_2_entrez) <- symbol_2_entrez$SYMBOL df_deg$entrezid <- symbol_2_entrez[rownames(df_deg),'ENTREZID' ] df_deg <- df_deg[!duplicated(df_deg$entrezid ),] df_deg <- na.omit(df_deg) a <- df_deg[(df_deg$change =='up' ),'logFC' ] names(a) <- df_deg[(df_deg$change =='up' ),'entrezid' ] res_a <- gseKEGG(a,'hsa' ) b <- df_deg[(df_deg$change =='down' ),'logFC' ] names(b) <- df_deg[(df_deg$change =='down' ),'entrezid' ] res_b <- gseKEGG(b,'hsa' ) gseaplot2(res_a,geneSetID = 1:5) gseaplot2(res_b,geneSetID = 1:5)
先做差异分析,找到high risk vs low risk的差异基因,上调和下调的gene分别拿去做GSEA再画图
# waterfall plot query <- GDCquery(project = 'TCGA-LUAD' , data.category = "Simple Nucleotide Variation" , data.type = "Masked Somatic Mutation" , access = 'open' ) GDCdownload(query) GDCprepare(query,save = T,save.filename = 'TCGA-LUAD-SNP.Rdata' ) load('TCGA-LUAD-SNP.Rdata' ) luad_snp <- read.maf(maf=data) plotmafSummary(luad_snp,rmOutlier = TRUE, addStat = 'median' , dashboard = TRUE, titvRaw = FALSE) oncoplot(luad_snp,top = 15)
Step 7. 高低风险组间的免疫浸润 library(GSVA) library(GSEABase) library(ggplot2) library(tidyr) rm(list = ls())#1. 获取geneSets # 下载28种免疫细胞的参考基因集 <http://cis./TISIDB/data/download/CellReports.txt> geneSet <- read.csv("cellReport.txt" ,header = F,sep = "\t" ,) geneSet[1:5,1:5] geneSet <- t(tibble::column_to_rownames(geneSet,var = 'V1' )) gene_list <- list()for (i in colnames(geneSet)){ gene_list[[i]] <- geneSet[,i] gene_list[[i]] <-gene_list[[i]][gene_list[[i]]!='' ] }
直接读取的geneSet长这样,有空值,需要转变成gene_list,list长度为28,其中每个元素是某个细胞所有基因的向量,名字是对应的细胞
# 载入表达矩阵,把表达矩阵分成high risk 和low risk组 load('TCGA_LUAD_TPM_exp.Rdata' ) load('risk.Rdata' )# 对表达矩阵取log2 exp <- log2(exp+0.00001) exp <- as.data.frame(exp) colnames(exp) <- substr(colnames(exp),1,12) exp <- exp[,!duplicated(colnames(exp))] high_risk_sample <- km_data[km_data$risk =='high' ,'ID' ] low_risk_sample <- km_data[km_data$risk =='low' ,'ID' ] high_risk_exp <- exp[,high_risk_sample] low_risk_exp <- exp[,low_risk_sample]# ssGSEA high_risk_gsea <- gsva(high_risk_exp,gene_list,method='ssgsea' ,kcdf='Gaussian' ,abs.ranking=T) low_risk_gsea <- gsva(low_risk_exp,gene_list,method='ssgsea' ,kcdf='Gaussian' ,abs.ranking=T)
要注意gsva函数的输入得是matrix,不能是data.frame,不然会报错
hr_gsea_norm <- high_risk_gseafor (i in colnames(hr_gsea_norm)) { diff <- max(hr_gsea_norm[,i] )-min(hr_gsea_norm[,i] ) hr_gsea_norm[,i] <- (hr_gsea_norm[,i] -min(hr_gsea_norm[,i]))/diff } hr_gsea_norm <- as.data.frame(t(hr_gsea_norm)) hr_gsea_norm$type <- rep('high' ,times =nrow(hr_gsea_norm)) lr_gsea_norm <- low_risk_gseafor (i in colnames(lr_gsea_norm)) { diff <- max(lr_gsea_norm[,i] )-min(lr_gsea_norm[,i]) lr_gsea_norm[,i] <- (lr_gsea_norm[,i] -min(lr_gsea_norm[,i]))/diff } lr_gsea_norm <- as.data.frame(t(lr_gsea_norm)) lr_gsea_norm$type <- rep('low' ,times =nrow(lr_gsea_norm)) gsea_norm <- rbind(hr_gsea_norm,lr_gsea_norm) longer <- pivot_longer(gsea_norm,cols = !type ,names_to = 'cell_type' ,values_to = 'Score' ) ggplot(longer,aes(x=cell_type,y=Score,color=type ))+geom_boxplot()+ theme(axis.text.x = element_text(angle = 45,hjust = 1))+ stat_compare_means(label = "p.signif" ,size=3,method = 'wilcox.test' )
先把high risk gsea和low risk gsea的数据分别进行min max scale,新增一列标记上high和low,然后合并起来,行是样本,列是细胞。再把透视表转成长表,就可以画图了
# 免疫检查点&TBM risk_exp <- as.data.frame(t(exp)) risk_exp <- risk_exp[km_data$ID ,] risk_exp$type <- km_data$risk risk_exp <- risk_exp[,c('CD274' ,'PDCD1' ,'CTLA4' ,'LAG3' ,'TIGIT' ,'type' )] risk_long <- pivot_longer(risk_exp,cols = !type ,names_to = 'ICIs' ,values_to = 'Expression' ) ggplot(risk_long,aes(x=ICIs,y=Expression,fill=type ))+geom_boxplot()+ stat_compare_means(label = "p.format" ,size=3,method = 'wilcox.test' )