上一篇讲了单个样本的单细胞标准分析流程,但是通常我们做实验不会只测一个样本的,这时候就需要把多个样本整合起来。
下面是3种常见的整合方法,不过都是从Seurat
对象开始的,如果你是从GEO等数据库下载的数据,需要先自己创建Seurat
对象,然后数据质控,选择合适的细胞,再进行整合。
目录:
多个单细胞数据集的整合之CCA library (Seurat)## Attaching SeuratObject ## Attaching sp library (SeuratData)## ── Installed datasets ─────────────────────────── SeuratData v0.2.2 ── ## ✔ ifnb 3.1.0 ## ───────────────────────────────── Key ──────────────────────────────── ## ✔ Dataset loaded successfully ## ❯ Dataset built with a newer version of Seurat than installed ## ❓ Unknown version of Seurat installed library (patchwork)
这里我们使用SeuratData
包自带的数据。这个数据有两个样本,一个是对照组,另一个是使用了干扰素刺激的。
# 安装数据,如果不成功可以下载后本地安装,报错信息里有下载地址 InstallData("ifnb" )
使用示例数据集,如果是从表达矩阵开始,就是之前介绍过的,需要自己建立seurat
对象。
# 载入数据 ifnb <- LoadData("ifnb" )# 拆分数据,变为两个seurat对象 ifnb.list <- SplitObject(ifnb, split.by = "stim" ) ifnb.list## $CTRL ## An object of class Seurat ## 14053 features across 6548 samples within 1 assay ## Active assay: RNA (14053 features, 0 variable features) ## ## $STIM ## An object of class Seurat ## 14053 features across 7451 samples within 1 assay ## Active assay: RNA (14053 features, 0 variable features)
现在我们有了两个seurat
对象,需要进行合并。
下面使用seurat
的合并方法:CCA。大家可以去百度了解下技术原理,这里就不多说了。
# 首先对每个对象单独进行标准化及找出高变基因 ifnb.list <- lapply(X = ifnb.list, FUN = function (x) { x <- NormalizeData(x) x <- FindVariableFeatures(x, selection.method = "vst" , nfeatures = 2000 ) })# 选择相同的高变基因 features <- SelectIntegrationFeatures(object.list = ifnb.list)
然后进行整合,这一步比较费时间~
# 找锚点 immune.anchors <- FindIntegrationAnchors(object.list = ifnb.list, anchor.features = features)## Scaling features for provided objects ## Finding all pairwise anchors ## Running CCA ## Merging objects ## Finding neighborhoods ## Finding anchors ## Found 16393 anchors ## Filtering anchors ## Retained 6756 anchors # 整合 immune.combined <- IntegrateData(anchorset = immune.anchors)## Merging dataset 1 into 2 ## Extracting anchors for merged samples ## Finding integration vectors ## Finding integration vector weights ## Integrating data
整合后的数据是这样的:
immune.combined## An object of class Seurat ## 16053 features across 13999 samples within 2 assays ## Active assay: integrated (2000 features, 2000 variable features) ## 1 other assay present: RNA
整合之后的分析 就是之前介绍的标准分析流程。
# 选择整合好之后的数据进行下游分析,原始数据也在这个对象中,可以试试看两种数据的差别 DefaultAssay(immune.combined) <- "integrated" #DefaultAssay(immune.combined) <- "RNA" # 原始数据 # 就是之前介绍的标准流程!中间很多可视化的过程省略了 immune.combined <- ScaleData(immune.combined, verbose = FALSE ) immune.combined <- RunPCA(immune.combined, npcs = 30 , verbose = FALSE ) immune.combined <- RunUMAP(immune.combined, reduction = "pca" , dims = 1 :30 )## 17:08:36 UMAP embedding parameters a = 0.9922 b = 1.112 ## 17:08:36 Read 13999 rows and found 30 numeric columns ## 17:08:36 Using Annoy for neighbor search, n_neighbors = 30 ## 17:08:36 Building Annoy index with metric = cosine, n_trees = 50 ## 0% 10 20 30 40 50 60 70 80 90 100% ## [----|----|----|----|----|----|----|----|----|----| ## **************************************************| ## 17:08:37 Writing NN index file to temp file C:\Users\liyue\AppData\Local\Temp\RtmpKY1e9T\file22a41045c37 ## 17:08:37 Searching Annoy index using 1 thread, search_k = 3000 ## 17:08:39 Annoy recall = 100% ## 17:08:40 Commencing smooth kNN distance calibration using 1 thread ## 17:08:40 Initializing from normalized Laplacian + noise ## 17:08:41 Commencing optimization for 200 epochs, with 617860 positive edges ## 17:08:53 Optimization finished immune.combined <- FindNeighbors(immune.combined, reduction = "pca" , dims = 1 :30 )## Computing nearest neighbor graph ## Computing SNN immune.combined <- FindClusters(immune.combined, resolution = 0.5 )## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck ## ## Number of nodes: 13999 ## Number of edges: 569052 ## ## Running Louvain algorithm... ## Maximum modularity in 10 random starts: 0.9057 ## Number of communities: 15 ## Elapsed time: 1 seconds
保存下结果,然后就可以进行各种后续的分析了,比如差异分析、细胞亚群注释、细胞通讯、拟时序分析、转录因子等等。
saveRDS(immune.combined, file = "../000files/immune_combined.rds" )
看看有没有批次效应和降维结果:
# 降维结果 p1 <- DimPlot(immune.combined, reduction = "umap" , group.by = "stim" ) p2 <- DimPlot(immune.combined, reduction = "umap" , label = TRUE , repel = TRUE ) p1 + p2
plot of chunk unnamed-chunk-9 然后是Identify conserved cell type markers
“ FindMarkers()寻找的是指定两组细胞之间差异表达的基因;而FindConservedMarkers()分析的是某一个ident群两组细胞之间都保守表达的基因。
# 使用原数据 DefaultAssay(immune.combined) <- "RNA" nk.markers <- FindConservedMarkers(immune.combined, ident.1 = 6 , grouping.var = "stim" , verbose = FALSE ) head(nk.markers)## CTRL_p_val CTRL_avg_log2FC CTRL_pct.1 CTRL_pct.2 CTRL_p_val_adj ## GNLY 0 6.006173 0.944 0.045 0 ## FGFBP2 0 3.243588 0.505 0.020 0 ## CLIC3 0 3.461957 0.597 0.024 0 ## PRF1 0 2.650548 0.422 0.017 0 ## CTSW 0 2.987507 0.531 0.029 0 ## KLRD1 0 2.777231 0.495 0.019 0 ## STIM_p_val STIM_avg_log2FC STIM_pct.1 STIM_pct.2 STIM_p_val_adj ## GNLY 0.000000e+00 5.858634 0.954 0.059 0.000000e+00 ## FGFBP2 3.408448e-165 2.191113 0.261 0.015 4.789892e-161 ## CLIC3 0.000000e+00 3.536367 0.623 0.030 0.000000e+00 ## PRF1 0.000000e+00 4.094579 0.862 0.057 0.000000e+00 ## CTSW 0.000000e+00 3.128054 0.592 0.035 0.000000e+00 ## KLRD1 0.000000e+00 2.863797 0.552 0.027 0.000000e+00 ## max_pval minimump_p_val ## GNLY 0.000000e+00 0 ## FGFBP2 3.408448e-165 0 ## CLIC3 0.000000e+00 0 ## PRF1 0.000000e+00 0 ## CTSW 0.000000e+00 0 ## KLRD1 0.000000e+00 0
画图展示:
FeaturePlot(immune.combined, features = c("CD3D" , "SELL" , "CREM" , "CD8A" , "GNLY" , "CD79A" , "FCGR3A" ,"CCL2" , "PPBP" ), min.cutoff = "q9" ) # 分位数
plot of chunk unnamed-chunk-12 重命名类群ID,重新画图:
# 现在还都是数字,没有注释亚群 table(Idents(immune.combined))
immune.combined <- RenameIdents(immune.combined, `0` = "CD14 Mono" , `1` = "CD4 Naive T" , `2` = "CD4 Memory T" , `3` = "CD16 Mono" , `4` = "B" , `5` = "CD8 T" , `6` = "NK" , `7` = "T activated" , `8` = "DC" , `9` = "B Activated" , `10` = "Mk" , `11` = "pDC" , `12` = "Eryth" , `13` = "Mono/Mk Doublets" , `14` = "HSPC" ) DimPlot(immune.combined, label = TRUE )
plot of chunk unnamed-chunk-13 气泡图展示:
Idents(immune.combined) <- factor(Idents(immune.combined), levels = c("HSPC" , "Mono/Mk Doublets" ,"pDC" , "Eryth" , "Mk" , "DC" , "CD14 Mono" , "CD16 Mono" , "B Activated" , "B" , "CD8 T" , "NK" , "T activated" ,"CD4 Naive T" , "CD4 Memory T" ) )# 选择要展示的基因 markers.to.plot <- c("CD3D" , "CREM" , "HSPH1" , "SELL" , "GIMAP5" , "CACYBP" , "GNLY" , "NKG7" , "CCL5" ,"CD8A" , "MS4A1" , "CD79A" , "MIR155HG" , "NME1" , "FCGR3A" , "VMO1" , "CCL2" , "S100A9" , "HLA-DQA1" ,"GPR183" , "PPBP" , "GNG11" , "HBA2" , "HBB" , "TSPAN13" , "IL3RA" , "IGJ" , "PRSS57" ) DotPlot(immune.combined, features = markers.to.plot, cols = c("blue" , "red" ), dot.scale = 8 , split.by = "stim" ) + RotatedAxis()
plot of chunk unnamed-chunk-14 上面是找conserved gene,下面才是常见的找差异基因。
# 先展示下某些基因在两组中的表达量 library (ggplot2)library (cowplot) theme_set(theme_cowplot())# 选择CD4 Naive T这个亚群继续探索 t.cells <- subset(immune.combined, idents = "CD4 Naive T" ) Idents(t.cells) <- "stim" # 亚群名字改一下 # 表达矩阵预处理一下 avg.t.cells <- as.data.frame(log1p(AverageExpression(t.cells, verbose = FALSE )$RNA)) # 行名变成一列 avg.t.cells$gene <- rownames(avg.t.cells)# 再选择CD14 Mono这个亚群 cd14.mono <- subset(immune.combined, idents = "CD14 Mono" ) Idents(cd14.mono) <- "stim" avg.cd14.mono <- as.data.frame(log1p(AverageExpression(cd14.mono, verbose = FALSE )$RNA)) avg.cd14.mono$gene <- rownames(avg.cd14.mono)# 挑选要展示的基因,挑选的这几个都是对干扰素刺激有明显变化的基因(这需要平时自己积累) genes.to.label = c("ISG15" , "LY6E" , "IFI6" , "ISG20" , "MX1" , "IFIT2" , "IFIT1" , "CXCL10" , "CCL8" )# 画图 p1 <- ggplot(avg.t.cells, aes(CTRL, STIM)) + geom_point() + ggtitle("CD4 Naive T Cells" ) p1 <- LabelPoints(plot = p1, points = genes.to.label, repel = TRUE ) p2 <- ggplot(avg.cd14.mono, aes(CTRL, STIM)) + geom_point() + ggtitle("CD14 Monocytes" ) p2 <- LabelPoints(plot = p2, points = genes.to.label, repel = TRUE ) p1 + p2
plot of chunk unnamed-chunk-15 上面这幅图可以看出,我们挑选的这几个基因在stim组是高表达的,不管是CD4 Naive T亚群还是CD14 Mono都是如此。
假设现在我们已经确定在两组间的细胞类型都是一样的,现在我们想要找出在B细胞(B cell)中两组间(stim和ctrl)的差异基因。
# 首先增加一栏同时包含样本类型(stim和ctrl)和细胞类型的信息 immune.combined$celltype.stim <- paste(Idents(immune.combined), immune.combined$stim, sep = "_" )# 再增加一栏只包含细胞类型的信息 immune.combined$celltype <- Idents(immune.combined)# 把idents换成celltype.stim这一栏信息 Idents(immune.combined) <- "celltype.stim" # 这样就可以寻找任一细胞类型在两种组别(stim和ctrl)之间的差异基因了 b.interferon.response <- FindMarkers(immune.combined, ident.1 = "B_STIM" , ident.2 = "B_CTRL" , verbose = FALSE ) head(b.interferon.response, n = 15 )## p_val avg_log2FC pct.1 pct.2 p_val_adj ## IFIT1 3.210958e-189 4.803316 1.000 0.093 4.512360e-185 ## ISG15 3.942443e-178 5.231845 1.000 0.486 5.540315e-174 ## IFIT3 4.460609e-177 3.903763 0.993 0.312 6.268494e-173 ## ISG20 3.273682e-176 3.836903 1.000 0.446 4.600505e-172 ## IFITM3 1.865615e-173 3.018934 1.000 0.647 2.621748e-169 ## IFIT2 3.125506e-170 3.738296 0.973 0.167 4.392273e-166 ## IFI6 7.326993e-170 2.992315 0.998 0.374 1.029662e-165 ## LY6E 2.335359e-169 3.007485 1.000 0.417 3.281880e-165 ## RSAD2 2.442390e-169 3.648253 0.955 0.081 3.432291e-165 ## TNFSF10 6.793514e-168 3.187577 1.000 0.483 9.546925e-164 ## MX1 6.173675e-166 3.232834 0.973 0.165 8.675865e-162 ## APOBEC3A 2.987551e-164 3.483479 0.996 0.403 4.198406e-160 ## CXCL10 4.695421e-160 4.444529 0.986 0.275 6.598475e-156 ## OASL 1.975803e-158 3.065838 0.957 0.182 2.776596e-154 ## IRF7 4.890417e-149 2.576510 0.980 0.463 6.872503e-145
这些基因和我们在上一步中挑选的基因是不是基本一样?
画图展示,可以看出CD3D和GNLY在两组间表达差异不大,而IF16差异很大。
FeaturePlot(immune.combined, features = c("CD3D" , "GNLY" , "IFI6" ), # CD3D和GNLY对干扰素没反应哦~ split.by = "stim" , max.cutoff = 3 , cols = c("grey" , "red" ))
plot of chunk unnamed-chunk-17 小提琴图也可以展示类似的效果:
plots <- VlnPlot(immune.combined, features = c("LYZ" , "ISG15" , "CXCL10" ), split.by = "stim" , group.by = "celltype" , pt.size = 0 , combine = FALSE )## The default behaviour of split.by has changed. ## Separate violin plots are now plotted side-by-side. ## To restore the old behaviour of a single split violin, ## set split.plot = TRUE. ## ## This message will be shown once per session. wrap_plots(plots = plots, ncol = 1 )
plot of chunk unnamed-chunk-18 多个单细胞数据集的整合之CCA+SCTransform 和上面的步骤基本一样,就是把NormalizeData/FindVariableFeatures/ScaleData
合成一步:SCTransform
。
rm(list = ls()) ifnb <- LoadData("ifnb" ) ifnb.list <- SplitObject(ifnb, split.by = "stim" )# 这一步代替了NormalizeData/FindVariableFeatures/ScaleData,非常费时间 ifnb.list <- lapply(X = ifnb.list, FUN = SCTransform)## Calculating cell attributes from input UMI matrix: log_umi ## Variance stabilizing transformation of count matrix of size 12747 by 6548 ## Model formula is y ~ log_umi ## Get Negative Binomial regression parameters per gene ## Using 2000 genes, 5000 cells ## |======================================================================| 100 %## Found 77 outliers - those will be ignored in fitting/regularization step ## Second step: Get residuals using fitted parameters for 12747 genes ## |======================================================================| 100 %## Computing corrected count matrix for 12747 genes ## |======================================================================| 100 %## Calculating gene attributes ## Wall clock passed: Time difference of 1.038463 mins ## Determine variable features ## Place corrected count matrix in counts slot ## Centering data matrix ## Set default assay to SCT ## Calculating cell attributes from input UMI matrix: log_umi ## Variance stabilizing transformation of count matrix of size 12658 by 7451 ## Model formula is y ~ log_umi ## Get Negative Binomial regression parameters per gene ## Using 2000 genes, 5000 cells ## |======================================================================| 100 %## Found 73 outliers - those will be ignored in fitting/regularization step ## Second step: Get residuals using fitted parameters for 12658 genes ## |======================================================================| 100 %## Computing corrected count matrix for 12658 genes ## |======================================================================| 100 %## Calculating gene attributes ## Wall clock passed: Time difference of 1.075562 mins ## Determine variable features ## Place corrected count matrix in counts slot ## Centering data matrix ## Set default assay to SCT features <- SelectIntegrationFeatures(object.list = ifnb.list, nfeatures = 3000 ) ifnb.list <- PrepSCTIntegration(object.list = ifnb.list, anchor.features = features)
然后也是找锚点,整合数据:
immune.anchors <- FindIntegrationAnchors(object.list = ifnb.list, normalization.method = "SCT" , anchor.features = features)## Finding all pairwise anchors ## Running CCA ## Merging objects ## Finding neighborhoods ## Finding anchors ## Found 14831 anchors ## Filtering anchors ## Retained 11318 anchors immune.combined.sct <- IntegrateData(anchorset = immune.anchors, normalization.method = "SCT" )## Merging dataset 1 into 2 ## Extracting anchors for merged samples ## Finding integration vectors ## Finding integration vector weights ## Integrating data
再往下就是一模一样的步骤了:
# SCTransform包含scale,这里就不用再次scale了 immune.combined.sct <- RunPCA(immune.combined.sct, verbose = FALSE ) |> FindNeighbors(dims = 1 :15 ) |> FindClusters(resolution = 0.5 ) |> RunUMAP(dims=1 :15 ) |> RunTSNE(dims=1 :15 )## 17:13:19 UMAP embedding parameters a = 0.9922 b = 1.112 ## 17:13:19 Read 13999 rows and found 30 numeric columns ## 17:13:19 Using Annoy for neighbor search, n_neighbors = 30 ## 17:13:19 Building Annoy index with metric = cosine, n_trees = 50 ## 0% 10 20 30 40 50 60 70 80 90 100% ## [----|----|----|----|----|----|----|----|----|----| ## **************************************************| ## 17:13:21 Writing NN index file to temp file C:\Users\liyue\AppData\Local\Temp\RtmpKY1e9T\file22a46a83352 ## 17:13:21 Searching Annoy index using 1 thread, search_k = 3000 ## 17:13:23 Annoy recall = 100% ## 17:13:23 Commencing smooth kNN distance calibration using 1 thread ## 17:13:24 Initializing from normalized Laplacian + noise ## 17:13:24 Commencing optimization for 200 epochs, with 601594 positive edges ## 17:13:37 Optimization finished
画图展示效果:
p1 <- DimPlot(immune.combined.sct, reduction = "umap" , group.by = "stim" ) p2 <- DimPlot(immune.combined.sct, reduction = "umap" , group.by = "seurat_annotations" , label = TRUE ,repel = TRUE ) p1 + p2
plot of chunk unnamed-chunk-22 多个单细胞数据集的整合之harmony rm(list = ls()) ifnb <- LoadData("ifnb" ) ifnb.list <- SplitObject(ifnb, split.by = "stim" )
先使用merge()
合并。
ifnb.merge <- merge(x=ifnb.list[[1 ]], y=ifnb.list[-1 ], add.cell.ids= names(ifnb.list), project="ifnb.harmoy" ) head(colnames(ifnb.merge))## [1] "CTRL_AAACATACATTTCC.1" "CTRL_AAACATACCAGAAA.1" "CTRL_AAACATACCTCGCT.1" ## [4] "CTRL_AAACATACCTGGTA.1" "CTRL_AAACATACGATGAA.1" "CTRL_AAACATACGGCATT.1"
接下来也是NormalizeData/FindVariableFeatures/ScaleData,这几步用SCTransform
也是可以的:
# 这一步用SCT代替也可以 ifnb.merge <- ifnb.merge |> Seurat::NormalizeData() |> FindVariableFeatures(selection.method = "vst" , nfeatures = 2000 ) |> ScaleData() |> RunPCA()## Centering and scaling data matrix
然后再进行harmony整合样本:
library (harmony)## Loading required package: Rcpp ifnb.merge <- RunHarmony(ifnb.merge, group.by.vars = "stim" )## Harmony 1/10 ## Harmony 2/10 ## Harmony 3/10 ## Harmony 4/10 ## Harmony 5/10 ## Harmony 6/10 ## Harmony 7/10 ## Harmony 8/10 ## Harmony converged after 8 iterations names(ifnb.merge@reductions)## [1] "pca" "harmony"
后面又是标准步骤,聚类,分群。
ifnb.merge <- RunUMAP(ifnb.merge, dims = 1 :15 , reduction = "harmony" ) |> FindNeighbors(reduction = "harmony" , dims = 1 :15 ) |> FindClusters(resolution = 0.5 ) |> RunTSNE(dims=1 :15 ,reduction = "harmony" )## 17:14:09 UMAP embedding parameters a = 0.9922 b = 1.112 ## 17:14:09 Read 13999 rows and found 15 numeric columns ## 17:14:09 Using Annoy for neighbor search, n_neighbors = 30 ## 17:14:09 Building Annoy index with metric = cosine, n_trees = 50 ## 0% 10 20 30 40 50 60 70 80 90 100% ## [----|----|----|----|----|----|----|----|----|----| ## **************************************************| ## 17:14:10 Writing NN index file to temp file C:\Users\liyue\AppData\Local\Temp\RtmpKY1e9T\file22a4682a6add ## 17:14:10 Searching Annoy index using 1 thread, search_k = 3000 ## 17:14:12 Annoy recall = 100% ## 17:14:13 Commencing smooth kNN distance calibration using 1 thread ## 17:14:13 Initializing from normalized Laplacian + noise ## 17:14:14 Commencing optimization for 200 epochs, with 597860 positive edges ## 17:14:26 Optimization finished
画图展示效果:
p1 <- DimPlot(ifnb.harmony, reduction = "tsne" , group.by = "stim" ) p2 <- DimPlot(ifnb.harmony, reduction = "tsne" , group.by = "seurat_annotations" , label = TRUE ,repel = TRUE ) p1 + p2
image-20220815131322282 参考资料