【原】单细胞入门之多样本整合

阿越就是我 2023-10-12 发布于上海

展开全文

💡专注R语言在🩺生物医学中的使用

上一篇讲了单个样本的单细胞标准分析流程，但是通常我们做实验不会只测一个样本的，这时候就需要把多个样本整合起来。

下面是3种常见的整合方法，不过都是从Seurat对象开始的，如果你是从GEO等数据库下载的数据，需要先自己创建Seurat对象，然后数据质控，选择合适的细胞，再进行整合。

多个单细胞数据集的整合之CCA
整合之后的分析
多个单细胞数据集的整合之CCA+SCTransform
多个单细胞数据集的整合之harmony
参考资料

多个单细胞数据集的整合之CCA

library(Seurat)
## Attaching SeuratObject
## Attaching sp
library(SeuratData)
## ── Installed datasets ─────────────────────────── SeuratData v0.2.2 ──
## ✔ ifnb 3.1.0
## ───────────────────────────────── Key ────────────────────────────────
## ✔ Dataset loaded successfully
## ❯ Dataset built with a newer version of Seurat than installed
## ❓ Unknown version of Seurat installed
library(patchwork)

这里我们使用SeuratData包自带的数据。这个数据有两个样本，一个是对照组，另一个是使用了干扰素刺激的。

# 安装数据，如果不成功可以下载后本地安装，报错信息里有下载地址
InstallData("ifnb")

使用示例数据集，如果是从表达矩阵开始，就是之前介绍过的，需要自己建立seurat对象。

# 载入数据
ifnb <- LoadData("ifnb")

# 拆分数据，变为两个seurat对象
ifnb.list <- SplitObject(ifnb, split.by = "stim")

ifnb.list
## $CTRL
## An object of class Seurat 
## 14053 features across 6548 samples within 1 assay 
## Active assay: RNA (14053 features, 0 variable features)
## 
## $STIM
## An object of class Seurat 
## 14053 features across 7451 samples within 1 assay 
## Active assay: RNA (14053 features, 0 variable features)

现在我们有了两个seurat对象，需要进行合并。

下面使用seurat的合并方法：CCA。大家可以去百度了解下技术原理，这里就不多说了。

# 首先对每个对象单独进行标准化及找出高变基因
ifnb.list <- lapply(X = ifnb.list, FUN = function(x) {
    x <- NormalizeData(x)
    x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
})

# 选择相同的高变基因
features <- SelectIntegrationFeatures(object.list = ifnb.list)

然后进行整合，这一步比较费时间~

# 找锚点
immune.anchors <- FindIntegrationAnchors(object.list = ifnb.list, anchor.features = features)
## Scaling features for provided objects
## Finding all pairwise anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 16393 anchors
## Filtering anchors
##  Retained 6756 anchors
# 整合
immune.combined <- IntegrateData(anchorset = immune.anchors)
## Merging dataset 1 into 2
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data

整合后的数据是这样的：

immune.combined
## An object of class Seurat 
## 16053 features across 13999 samples within 2 assays 
## Active assay: integrated (2000 features, 2000 variable features)
##  1 other assay present: RNA

整合之后的分析

就是之前介绍的标准分析流程。

# 选择整合好之后的数据进行下游分析，原始数据也在这个对象中，可以试试看两种数据的差别
DefaultAssay(immune.combined) <- "integrated"
#DefaultAssay(immune.combined) <- "RNA" # 原始数据

# 就是之前介绍的标准流程！中间很多可视化的过程省略了
immune.combined <- ScaleData(immune.combined, verbose = FALSE)
immune.combined <- RunPCA(immune.combined, npcs = 30, verbose = FALSE)
immune.combined <- RunUMAP(immune.combined, reduction = "pca", dims = 1:30)
## 17:08:36 UMAP embedding parameters a = 0.9922 b = 1.112
## 17:08:36 Read 13999 rows and found 30 numeric columns
## 17:08:36 Using Annoy for neighbor search, n_neighbors = 30
## 17:08:36 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 17:08:37 Writing NN index file to temp file C:\Users\liyue\AppData\Local\Temp\RtmpKY1e9T\file22a41045c37
## 17:08:37 Searching Annoy index using 1 thread, search_k = 3000
## 17:08:39 Annoy recall = 100%
## 17:08:40 Commencing smooth kNN distance calibration using 1 thread
## 17:08:40 Initializing from normalized Laplacian + noise
## 17:08:41 Commencing optimization for 200 epochs, with 617860 positive edges
## 17:08:53 Optimization finished
immune.combined <- FindNeighbors(immune.combined, reduction = "pca", dims = 1:30)
## Computing nearest neighbor graph
## Computing SNN
immune.combined <- FindClusters(immune.combined, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 13999
## Number of edges: 569052
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9057
## Number of communities: 15
## Elapsed time: 1 seconds

保存下结果，然后就可以进行各种后续的分析了，比如差异分析、细胞亚群注释、细胞通讯、拟时序分析、转录因子等等。

saveRDS(immune.combined, file = "../000files/immune_combined.rds")

看看有没有批次效应和降维结果：

# 降维结果
p1 <- DimPlot(immune.combined, reduction = "umap", group.by = "stim")
p2 <- DimPlot(immune.combined, reduction = "umap", label = TRUE, repel = TRUE)
p1 + p2

然后是Identify conserved cell type markers

“
FindMarkers()寻找的是指定两组细胞之间差异表达的基因；而FindConservedMarkers()分析的是某一个ident群两组细胞之间都保守表达的基因。

# 使用原数据
DefaultAssay(immune.combined) <- "RNA"
nk.markers <- FindConservedMarkers(immune.combined, 
                                   ident.1 = 6, 
                                   grouping.var = "stim", 
                                   verbose = FALSE)
head(nk.markers)
##        CTRL_p_val CTRL_avg_log2FC CTRL_pct.1 CTRL_pct.2 CTRL_p_val_adj
## GNLY            0        6.006173      0.944      0.045              0
## FGFBP2          0        3.243588      0.505      0.020              0
## CLIC3           0        3.461957      0.597      0.024              0
## PRF1            0        2.650548      0.422      0.017              0
## CTSW            0        2.987507      0.531      0.029              0
## KLRD1           0        2.777231      0.495      0.019              0
##           STIM_p_val STIM_avg_log2FC STIM_pct.1 STIM_pct.2 STIM_p_val_adj
## GNLY    0.000000e+00        5.858634      0.954      0.059   0.000000e+00
## FGFBP2 3.408448e-165        2.191113      0.261      0.015  4.789892e-161
## CLIC3   0.000000e+00        3.536367      0.623      0.030   0.000000e+00
## PRF1    0.000000e+00        4.094579      0.862      0.057   0.000000e+00
## CTSW    0.000000e+00        3.128054      0.592      0.035   0.000000e+00
## KLRD1   0.000000e+00        2.863797      0.552      0.027   0.000000e+00
##             max_pval minimump_p_val
## GNLY    0.000000e+00              0
## FGFBP2 3.408448e-165              0
## CLIC3   0.000000e+00              0
## PRF1    0.000000e+00              0
## CTSW    0.000000e+00              0
## KLRD1   0.000000e+00              0

画图展示：

FeaturePlot(immune.combined, 
            features = c("CD3D", "SELL", "CREM", "CD8A", "GNLY", "CD79A", "FCGR3A","CCL2", "PPBP"), 
            min.cutoff = "q9") # 分位数

重命名类群ID，重新画图：

# 现在还都是数字，没有注释亚群
table(Idents(immune.combined))

immune.combined <- RenameIdents(immune.combined, 
                                `0` = "CD14 Mono", 
                                `1` = "CD4 Naive T", 
                                `2` = "CD4 Memory T",
                                `3` = "CD16 Mono", 
                                `4` = "B", 
                                `5` = "CD8 T", 
                                `6` = "NK", 
                                `7` = "T activated", 
                                `8` = "DC", 
                                `9` = "B Activated",
                                `10` = "Mk", 
                                `11` = "pDC", 
                                `12` = "Eryth", 
                                `13` = "Mono/Mk Doublets", 
                                `14` = "HSPC"
                                )
DimPlot(immune.combined, label = TRUE)

气泡图展示：

Idents(immune.combined) <- factor(Idents(immune.combined), 
                                  levels = c("HSPC", "Mono/Mk Doublets","pDC", "Eryth", "Mk", "DC", "CD14 Mono", "CD16 Mono", "B Activated", "B", "CD8 T", "NK", "T activated","CD4 Naive T", "CD4 Memory T")
                                  )

# 选择要展示的基因
markers.to.plot <- c("CD3D", "CREM", "HSPH1", "SELL", "GIMAP5", "CACYBP", "GNLY", "NKG7", "CCL5","CD8A", "MS4A1", "CD79A", "MIR155HG", "NME1", "FCGR3A", "VMO1", "CCL2", "S100A9", "HLA-DQA1","GPR183", "PPBP", "GNG11", "HBA2", "HBB", "TSPAN13", "IL3RA", "IGJ", "PRSS57")

DotPlot(immune.combined, features = markers.to.plot, cols = c("blue", "red"), dot.scale = 8, split.by = "stim") +
  RotatedAxis()

上面是找conserved gene，下面才是常见的找差异基因。

# 先展示下某些基因在两组中的表达量

library(ggplot2)
library(cowplot)
theme_set(theme_cowplot())

# 选择CD4 Naive T这个亚群继续探索
t.cells <- subset(immune.combined, idents = "CD4 Naive T")

Idents(t.cells) <- "stim" # 亚群名字改一下

# 表达矩阵预处理一下
avg.t.cells <- as.data.frame(log1p(AverageExpression(t.cells, verbose = FALSE)$RNA)) 

# 行名变成一列
avg.t.cells$gene <- rownames(avg.t.cells)

# 再选择CD14 Mono这个亚群
cd14.mono <- subset(immune.combined, idents = "CD14 Mono")
Idents(cd14.mono) <- "stim"
avg.cd14.mono <- as.data.frame(log1p(AverageExpression(cd14.mono, verbose = FALSE)$RNA))
avg.cd14.mono$gene <- rownames(avg.cd14.mono)

# 挑选要展示的基因，挑选的这几个都是对干扰素刺激有明显变化的基因（这需要平时自己积累）
genes.to.label = c("ISG15", "LY6E", "IFI6", "ISG20", "MX1", "IFIT2", "IFIT1", "CXCL10", "CCL8")

# 画图
p1 <- ggplot(avg.t.cells, aes(CTRL, STIM)) + 
  geom_point() + 
  ggtitle("CD4 Naive T Cells")

p1 <- LabelPoints(plot = p1, 
                  points = genes.to.label, 
                  repel = TRUE)

p2 <- ggplot(avg.cd14.mono, aes(CTRL, STIM)) + 
  geom_point() + 
  ggtitle("CD14 Monocytes")

p2 <- LabelPoints(plot = p2, 
                  points = genes.to.label, 
                  repel = TRUE)
p1 + p2

上面这幅图可以看出，我们挑选的这几个基因在stim组是高表达的，不管是CD4 Naive T亚群还是CD14 Mono都是如此。

假设现在我们已经确定在两组间的细胞类型都是一样的，现在我们想要找出在B细胞（B cell）中两组间（stim和ctrl）的差异基因。

# 首先增加一栏同时包含样本类型（stim和ctrl）和细胞类型的信息
immune.combined$celltype.stim <- paste(Idents(immune.combined), 
                                       immune.combined$stim, 
                                       sep = "_")

# 再增加一栏只包含细胞类型的信息
immune.combined$celltype <- Idents(immune.combined)

# 把idents换成celltype.stim这一栏信息
Idents(immune.combined) <- "celltype.stim"

# 这样就可以寻找任一细胞类型在两种组别（stim和ctrl）之间的差异基因了
b.interferon.response <- FindMarkers(immune.combined, 
                                     ident.1 = "B_STIM", 
                                     ident.2 = "B_CTRL", 
                                     verbose = FALSE)

head(b.interferon.response, n = 15)
##                  p_val avg_log2FC pct.1 pct.2     p_val_adj
## IFIT1    3.210958e-189   4.803316 1.000 0.093 4.512360e-185
## ISG15    3.942443e-178   5.231845 1.000 0.486 5.540315e-174
## IFIT3    4.460609e-177   3.903763 0.993 0.312 6.268494e-173
## ISG20    3.273682e-176   3.836903 1.000 0.446 4.600505e-172
## IFITM3   1.865615e-173   3.018934 1.000 0.647 2.621748e-169
## IFIT2    3.125506e-170   3.738296 0.973 0.167 4.392273e-166
## IFI6     7.326993e-170   2.992315 0.998 0.374 1.029662e-165
## LY6E     2.335359e-169   3.007485 1.000 0.417 3.281880e-165
## RSAD2    2.442390e-169   3.648253 0.955 0.081 3.432291e-165
## TNFSF10  6.793514e-168   3.187577 1.000 0.483 9.546925e-164
## MX1      6.173675e-166   3.232834 0.973 0.165 8.675865e-162
## APOBEC3A 2.987551e-164   3.483479 0.996 0.403 4.198406e-160
## CXCL10   4.695421e-160   4.444529 0.986 0.275 6.598475e-156
## OASL     1.975803e-158   3.065838 0.957 0.182 2.776596e-154
## IRF7     4.890417e-149   2.576510 0.980 0.463 6.872503e-145

这些基因和我们在上一步中挑选的基因是不是基本一样？

画图展示，可以看出CD3D和GNLY在两组间表达差异不大，而IF16差异很大。

FeaturePlot(immune.combined, 
            features = c("CD3D", "GNLY", "IFI6"), # CD3D和GNLY对干扰素没反应哦~
            split.by = "stim", 
            max.cutoff = 3,
            cols = c("grey", "red"))

小提琴图也可以展示类似的效果：

plots <- VlnPlot(immune.combined, features = c("LYZ", "ISG15", "CXCL10"), split.by = "stim", group.by = "celltype",
    pt.size = 0, combine = FALSE)
## The default behaviour of split.by has changed.
## Separate violin plots are now plotted side-by-side.
## To restore the old behaviour of a single split violin,
## set split.plot = TRUE.
##       
## This message will be shown once per session.
wrap_plots(plots = plots, ncol = 1)

多个单细胞数据集的整合之CCA+SCTransform

和上面的步骤基本一样，就是把NormalizeData/FindVariableFeatures/ScaleData合成一步：SCTransform。

rm(list = ls())

ifnb <- LoadData("ifnb")
ifnb.list <- SplitObject(ifnb, split.by = "stim")

# 这一步代替了NormalizeData/FindVariableFeatures/ScaleData，非常费时间
ifnb.list <- lapply(X = ifnb.list, FUN = SCTransform)
## Calculating cell attributes from input UMI matrix: log_umi
## Variance stabilizing transformation of count matrix of size 12747 by 6548
## Model formula is y ~ log_umi
## Get Negative Binomial regression parameters per gene
## Using 2000 genes, 5000 cells
##                                                                          
  |======================================================================| 100%
## Found 77 outliers - those will be ignored in fitting/regularization step
## Second step: Get residuals using fitted parameters for 12747 genes
## 
  |======================================================================| 100%
## Computing corrected count matrix for 12747 genes
##                                                                           
  |======================================================================| 100%
## Calculating gene attributes
## Wall clock passed: Time difference of 1.038463 mins
## Determine variable features
## Place corrected count matrix in counts slot
## Centering data matrix
## Set default assay to SCT
## Calculating cell attributes from input UMI matrix: log_umi
## Variance stabilizing transformation of count matrix of size 12658 by 7451
## Model formula is y ~ log_umi
## Get Negative Binomial regression parameters per gene
## Using 2000 genes, 5000 cells
## 
  |======================================================================| 100%
## Found 73 outliers - those will be ignored in fitting/regularization step
## Second step: Get residuals using fitted parameters for 12658 genes
##                                                                        
  |======================================================================| 100%
## Computing corrected count matrix for 12658 genes
##                                                                         
  |======================================================================| 100%
## Calculating gene attributes
## Wall clock passed: Time difference of 1.075562 mins
## Determine variable features
## Place corrected count matrix in counts slot
## Centering data matrix
## Set default assay to SCT
features <- SelectIntegrationFeatures(object.list = ifnb.list, nfeatures = 3000)
ifnb.list <- PrepSCTIntegration(object.list = ifnb.list, anchor.features = features)

然后也是找锚点，整合数据：

immune.anchors <- FindIntegrationAnchors(object.list = ifnb.list, normalization.method = "SCT",
    anchor.features = features)
## Finding all pairwise anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 14831 anchors
## Filtering anchors
##  Retained 11318 anchors
immune.combined.sct <- IntegrateData(anchorset = immune.anchors, normalization.method = "SCT")
## Merging dataset 1 into 2
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data

再往下就是一模一样的步骤了：

# SCTransform包含scale，这里就不用再次scale了
immune.combined.sct <- RunPCA(immune.combined.sct, verbose = FALSE) |> 
  FindNeighbors(dims = 1:15) |> 
  FindClusters(resolution = 0.5) |> 
  RunUMAP(dims=1:15) |> 
  RunTSNE(dims=1:15)
## 17:13:19 UMAP embedding parameters a = 0.9922 b = 1.112
## 17:13:19 Read 13999 rows and found 30 numeric columns
## 17:13:19 Using Annoy for neighbor search, n_neighbors = 30
## 17:13:19 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 17:13:21 Writing NN index file to temp file C:\Users\liyue\AppData\Local\Temp\RtmpKY1e9T\file22a46a83352
## 17:13:21 Searching Annoy index using 1 thread, search_k = 3000
## 17:13:23 Annoy recall = 100%
## 17:13:23 Commencing smooth kNN distance calibration using 1 thread
## 17:13:24 Initializing from normalized Laplacian + noise
## 17:13:24 Commencing optimization for 200 epochs, with 601594 positive edges
## 17:13:37 Optimization finished

画图展示效果：

p1 <- DimPlot(immune.combined.sct, reduction = "umap", group.by = "stim")
p2 <- DimPlot(immune.combined.sct, reduction = "umap", group.by = "seurat_annotations", label = TRUE,repel = TRUE)
p1 + p2

多个单细胞数据集的整合之harmony

rm(list = ls())

ifnb <- LoadData("ifnb")
ifnb.list <- SplitObject(ifnb, split.by = "stim")

先使用merge()合并。

ifnb.merge <- merge(x=ifnb.list[[1]],
                    y=ifnb.list[-1],
                    add.cell.ids= names(ifnb.list),
                    project="ifnb.harmoy")
head(colnames(ifnb.merge))
## [1] "CTRL_AAACATACATTTCC.1" "CTRL_AAACATACCAGAAA.1" "CTRL_AAACATACCTCGCT.1"
## [4] "CTRL_AAACATACCTGGTA.1" "CTRL_AAACATACGATGAA.1" "CTRL_AAACATACGGCATT.1"

接下来也是NormalizeData/FindVariableFeatures/ScaleData，这几步用SCTransform也是可以的：

# 这一步用SCT代替也可以
ifnb.merge <- ifnb.merge |> 
  Seurat::NormalizeData() |>
  FindVariableFeatures(selection.method = "vst", nfeatures = 2000) |>
  ScaleData() |>
  RunPCA()
## Centering and scaling data matrix

然后再进行harmony整合样本：

library(harmony)
## Loading required package: Rcpp
ifnb.merge <- RunHarmony(ifnb.merge, 
                         group.by.vars = "stim")
## Harmony 1/10
## Harmony 2/10
## Harmony 3/10
## Harmony 4/10
## Harmony 5/10
## Harmony 6/10
## Harmony 7/10
## Harmony 8/10
## Harmony converged after 8 iterations

names(ifnb.merge@reductions)
## [1] "pca"     "harmony"

后面又是标准步骤，聚类，分群。

ifnb.merge <- RunUMAP(ifnb.merge,  dims = 1:15, reduction = "harmony") |> 
  FindNeighbors(reduction = "harmony", dims = 1:15) |> 
  FindClusters(resolution = 0.5) |> 
  RunTSNE(dims=1:15,reduction = "harmony")
## 17:14:09 UMAP embedding parameters a = 0.9922 b = 1.112
## 17:14:09 Read 13999 rows and found 15 numeric columns
## 17:14:09 Using Annoy for neighbor search, n_neighbors = 30
## 17:14:09 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 17:14:10 Writing NN index file to temp file C:\Users\liyue\AppData\Local\Temp\RtmpKY1e9T\file22a4682a6add
## 17:14:10 Searching Annoy index using 1 thread, search_k = 3000
## 17:14:12 Annoy recall = 100%
## 17:14:13 Commencing smooth kNN distance calibration using 1 thread
## 17:14:13 Initializing from normalized Laplacian + noise
## 17:14:14 Commencing optimization for 200 epochs, with 597860 positive edges
## 17:14:26 Optimization finished

画图展示效果：

p1 <- DimPlot(ifnb.harmony, reduction = "tsne", group.by = "stim")
p2 <- DimPlot(ifnb.harmony, reduction = "tsne", group.by = "seurat_annotations", label = TRUE,repel = TRUE)
p1 + p2