【原】单细胞亚群合并与提取（2021公开课配套笔记）

健明 2021-07-14

展开全文

新课发布在B站了，马上有热心的粉丝看完后写了配套笔记：

下面是粉丝linbo的笔记投稿

前言

视频地址：https://www.bilibili.com/video/BV19Q4y1R7cu

一般来讲，不同亚群的细胞具有不同的功能，因此亚群分析对于研究十分重要，会影响后续分析结果。

进行细胞亚群的分群的依据：

细胞异质性（每个细胞都有独特的表达模式和功能，都有自己特有的基因）；
细胞共性（同一类型的细胞都有近似的表达模式）；
生物学基础知识（基于已有的知识，对细胞进行分类鉴定）

单细胞聚类很少能一次性得到符合预期的结果，需要对结果不断调整，如细胞亚群数目调整。如上文中所言，Seurat工具可通过调整FindClusters函数中的resolution参数进行调整细胞亚群数目，一般在0.1-1之间，值越大，亚群数目越多，但是亚群数目过多，后续分析越耗力耗神。另一种方法是对感兴趣的细胞亚群进行亚群细分或者对相近的细胞亚群进行合并，进而来增加或减少亚群数目。

然而在单细胞数据分析中，一般初步的亚群分类结果可能将同类细胞分成若亚类。基于分子标记，我们可以将其归为同类细胞的亚类重新合并。

下面以 seurat 官方教程为例：

## 魔幻清空
rm(list = ls())
## 加载R包
library(Seurat)
library(ggplot2)
library(patchwork)
library(dplyr)

## 导入上节保存的数据
load(file = 'basic.sce.pbmc.Rdata')
levels(Idents(pbmc)) #查看细胞亚群 
# [1] "Naive CD4 T"  "CD14+ Mono"   "Memory CD4 T" "B"            "CD8 T"        "FCGR3A+ Mono"
# [7] "NK"           "DC"           "Platelet"  

## UMAP可视化
DimPlot(pbmc, reduction = 'umap', 
        label = TRUE, pt.size = 0.5) + NoLegend()

亚群合并

假如我们的生物学背景知识不够，不需要把T细胞分成 "Naive CD4 T" , "Memory CD4 T" , "CD8 T", "NK" 这些亚群，因此可以将这些亚群合并为T细胞这个大的亚群：

##  'PTPRC', 'CD3D', 'CD3E','CD4', 'CD8A','FOXP3','KLRD1', ## Tcells
##  'CD19', 'CD79A', 'MS4A1' , # B cells
##  'IGHG1', 'MZB1', 'SDC1',  # plasma 
##  'CD68', 'CD163', 'CD14', 'CD86','CCL22', 'S100A4', ## DC(belong to monocyte)
##  'CD68', 'CD163', 'MRC1', 'MSR1', 'S100A8', ## Macrophage (belong to monocyte)
##  'PRF1', 'NKG7', 'KLRD1', ## NKcells
##  'PPBP', ##Platelet

sce=pbmc  ##为防止影响pbmc本来的数据，接下来将以sce变量进行
genes_to_check = c('PTPRC', 'CD3D', 'CD3E',  'PRF1' , 'NKG7', 
                   'CD19', 'CD79A', 'MS4A1' ,
                   'CD68', 'CD163', 'CD14',
                   'FCER1A', 'PPBP' )
DotPlot(sce, group.by = 'seurat_clusters',
        features = unique(genes_to_check)) + RotatedAxis()

接下来我们对亚群进行合并，目的只能区分 B、DC、Mono、Platelet、T 这5个细胞亚群。

方法一：使用 RenameIdents 函数

levels(Idents(sce)) #查看细胞亚群，与上述结果一致
## 根据levels(Idents(sce)) 顺序重新赋予对应的 B、DC、Mono、Platelet、T 这5个细胞亚群名称，顺序不能乱
new.cluster.ids <- c("T", "Mono", "T", 
                     "B", "T", "Mono",
                     "T", "DC", "Platelet")
names(new.cluster.ids) <- levels(sce)
sce <- RenameIdents(sce, new.cluster.ids)
levels(sce) #查看是否已改名
# [1] "T"        "Mono"     "B"        "DC"       "Platelet"

DimPlot(sce, reduction = 'umap', 
        label = TRUE, pt.size = 0.5) + NoLegend()

方法二：使用unname函数配合向量：

cluster2celltype <- c("0"="T",
                      "1"="Mono", 
                      "2"="T", 
                      "3"= "B", 
                      "4"= "T", 
                      "5"= "Mono",
                      "6"= "T", 
                      "7"= "DC", 
                      "8"= "Platelet")
sce[['cell_type']] = unname(cluster2celltype[sce@meta.data$seurat_clusters])
DimPlot(sce, reduction = 'umap', group.by = 'cell_type',
        label = TRUE, pt.size = 0.5) + NoLegend()

方法三：使用数据框

(n=length(unique(sce@meta.data$seurat_clusters))) #获取亚群个数
celltype=data.frame(ClusterID=0:(n-1),
                    celltype='unkown')  #构建数据框
                    
## 判断亚群ID属于那类细胞
celltype[celltype$ClusterID %in% c(0,2,4,6),2]='T'
celltype[celltype$ClusterID %in% c(3),2]='B'
celltype[celltype$ClusterID %in% c(1,5),2]='Mono' 
celltype[celltype$ClusterID %in% c(7),2]='DC' 
celltype[celltype$ClusterID %in% c(8),2]='Platelet'

## 重新赋值
sce@meta.data$celltype = "NA"
for(i in 1:nrow(celltype)){
  sce@meta.data[which(sce@meta.data$seurat_clusters == celltype$ClusterID[i]),'celltype'] <- celltype$celltype[i]}
table(sce@meta.data$celltype)
#        B       DC     Mono Platelet        T 
#      342       32      642       14     1608 

DimPlot(sce, reduction = 'umap', group.by = 'celltype',
        label = TRUE, pt.size = 0.5) + NoLegend()

## 查看多种方法结果
table(sce@meta.data$cell_type,
      sce@meta.data$celltype)
#              B   DC Mono Platelet    T
#  B         342    0    0        0    0
#  DC          0   32    0        0    0
#  Mono        0    0  642        0    0
#  Platelet    0    0    0       14    0
#  T           0    0    0        0 1608

## 可视化一下
p1=DimPlot(sce, reduction = 'umap', group.by = 'celltype',
           label = TRUE, pt.size = 0.5) + NoLegend()
p2=DotPlot(sce, group.by = 'celltype',
           features = unique(genes_to_check)) + RotatedAxis()
p1+p2

最后总结一下，合并亚群其实就是亚群重命名，实现的基本原理就是将分析过程聚类亚群0-8重新命名，核心技术是R语言中匹配替换功能。当然这仅仅是示例，实际分析过程中可能不会这么简单，会略微复杂，但是万变不离其宗，掌握好R语言基本功是硬道理。所谓基础不牢，地动山摇，希望大家在前进的道路上可以温故而知新。好了，不说了，我要去学R了。

亚群提取

在单细胞分析过程中，我们通常会对感兴趣的细胞亚群进行亚群细分，这样可以把一个亚群或者多个亚群提取出来，然后再进行亚群细分。亚群细分有两种方法：第一种，调整FindClusters函数中的resolution参数使亚群数目增多；第二种，将此亚群提取出来，再重新降维聚类分群。

上文我们将几个亚群合并为T细胞这个大亚群。接下来我们看看如何对这部分细胞亚群进行亚群细分。

## 重新读取数据
rm(list = ls())
library(Seurat)
library(ggplot2)
library(patchwork)
library(dplyr)
load(file = 'basic.sce.pbmc.Rdata')
sce=pbmc

首先定位到感兴趣的亚群

levels(sce)
# [1] "Naive CD4 T"  "CD14+ Mono"   "Memory CD4 T" "B"            "CD8 T"        "FCGR3A+ Mono"
# [7] "NK"           "DC"           "Platelet" 

genes_to_check = c('PTPRC', 'CD3D', 'CD3E', 
                   'CD4','IL7R','NKG7','CD8A')

p1=DimPlot(sce, reduction = 'umap', group.by = 'seurat_clusters',
           label = TRUE, pt.size = 0.5) + NoLegend()
p2=DotPlot(sce, group.by = 'seurat_clusters',
           features = unique(genes_to_check)) + RotatedAxis()
p1+p2

假设这个时候，我们想提取CD4的T细胞，那么根据上文聚类0/2/4/6均为T细胞，其中0和2表达CD4相对4/6较高，但是其实示例里面的CD4的T细胞并不怎么高表达CD4，在此不深究，继续向下走。

提取指定单细胞亚群

cd4_sce1 = sce[,sce@meta.data$seurat_clusters %in% c(0,2)]
cd4_sce2 = sce[, Idents(sce) %in% c( "Naive CD4 T" ,  "Memory CD4 T" )]
cd4_sce3 = subset(sce,seurat_clusters %in% c(0,2))
## 较简单，核心原理就是R里取子集的3种策略：逻辑值，坐标，名字

重新降维聚类分群

sce=cd4_sce1
sce <- NormalizeData(sce, normalization.method = "LogNormalize", scale.factor = 1e4) 
sce <- FindVariableFeatures(sce, selection.method = 'vst', nfeatures = 2000)
sce <- ScaleData(sce, vars.to.regress = "percent.mt")
sce <- RunPCA(sce, features = VariableFeatures(object = sce)) 
sce <- FindNeighbors(sce, dims = 1:10)
sce <- FindClusters(sce, resolution = 1 )
head(Idents(sce), 5)
table(sce$seurat_clusters) 
sce <- RunUMAP(sce, dims = 1:10)
DimPlot(sce, reduction = 'umap')

genes_to_check = c('PTPRC', 'CD3D', 'CD3E', 'FOXP3',
                   'CD4','IL7R','NKG7','CD8A')
DotPlot(sce, group.by = 'seurat_clusters',
        features = unique(genes_to_check)) + RotatedAxis()
# 亚群水平 
p1=DimPlot(sce, reduction = 'umap', group.by = 'seurat_clusters',
           label = TRUE, pt.size = 0.5) + NoLegend()
p2=DotPlot(sce, group.by = 'seurat_clusters',
           features = unique(genes_to_check)) + RotatedAxis()
p1+p2

可以看到，这次重新降维聚类分群已经是非常的勉强，各细胞亚群之间的界限并不是很清晰。仅仅提取出来CD4的T细胞进行细分亚群，而结果又多出来了一个CD8 T细胞亚群，令人费解。尝试将resolution改成0.5再试一次，结果如下：

这次分群比上次要好点，但还是难以接受，掌握技能而已，不纠结，后面再深入探查。

单细胞分多少群合适

我的细胞到底分多少个群是合适的？这个问题是像我这样刚入门的菜鸡做容易产生的问题，每个细胞都是独一无二的，也就是最小可以以单细胞为单位，但是高通量的单细胞测序技术是数以万计的细胞，一个一个看太费眼了，关键是我要毕业，时间不等人。那么，我的细胞到底分多少个群是合适的？

那我们来试试clustree，首先依旧是读取我们的数据

## 重新读取数据
rm(list = ls())
library(Seurat)
library(ggplot2)
library(patchwork)
library(dplyr)
load(file = 'basic.sce.pbmc.Rdata')
sce=pbmc
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5) #直接使用了官方resolution = 0.5 

## 实际上resolution 是可以多次调试的,执行不同resolution 下的分群
library(clustree)
sce <- FindClusters(
  object = sce,
  resolution = c(seq(.1,1.6,.2))
)
clustree(sce@meta.data, prefix = "RNA_snn_res.")

如图所示，可以看到不同的resolution ，分群的变化情况。红框圈出来的对应我们使用的 resolution = 0.5 ，聚类9个亚群。根据动态分群的树，可见3对应的B细胞亚群，无论怎么样调整参数，都很难细分了，同样还有7对应DC亚群，和8对应的Platelet亚群也是很难再细分啦。但是T细胞和monocyte还有进一步细分的可能性！

p1=DimPlot(sce, reduction = 'umap', group.by = 'RNA_snn_res.0.5',
           label = TRUE, pt.size = 0.5) + NoLegend()
p2=DimPlot(sce, reduction = 'umap',group.by = 'RNA_snn_res.1.5',
           label = TRUE, pt.size = 0.5) + NoLegend()
p1+p2

这个时候就需要根据研究目的判断一下了，即使有分群的可能性，但不代表一定要进行细分亚群，一味的追求细分亚群是没有意义的！

比如前面的CD4的T细胞亚群细分：

load(file = 'sce.cd4.subset.Rdata')

#先执行不同resolution 下的分群
library(Seurat)
library(clustree)
sce <- FindClusters(
  object = sce,
  resolution = c(seq(.1,1.6,.2))
)
clustree(sce@meta.data, prefix = "RNA_snn_res.")

这时候分群混乱，相互交错，个人觉得没啥意义。分群是为了下一步分析，切莫作茧自缚。

单细胞亚群抽样

单细胞的gsva, 细胞通讯，转录因子，拟时序, inferCNV这些分析，特别的消耗计算资源，因为实际项目中每个细胞亚群都是过万的细胞，希望可以将这些单细胞亚群进行抽样，使得其细胞数量一致。

rm(list = ls())
library(Seurat)
library(ggplot2)
library(patchwork)
library(dplyr)
## 读取数据
load(file = 'basic.sce.pbmc.Rdata')
DimPlot(pbmc, reduction = 'umap', 
        label = TRUE, pt.size = 0.5) + NoLegend()
sce=pbmc

features= c('IL7R', 'CCR7','CD14', 'LYZ',  'IL7R', 'S100A4',"MS4A1", "CD8A",'FOXP3',
            'FCGR3A', 'MS4A7', 'GNLY', 'NKG7',
            'FCER1A', 'CST3','PPBP')

DoHeatmap(subset(sce ), 
          features = features, 
          size = 3
          )

如图，细胞数量较少的亚群在热图中占很少。

方法一：subset函数

table(Idents(sce))
#  Naive CD4 T   CD14+ Mono Memory CD4 T            B        CD8 T FCGR3A+ Mono           NK 
#          709          480          429          342          316          162          154 
#           DC     Platelet 
#           32           14 
## 可见最少的一个亚群Platelet细胞只有14个细胞，因此我们对每个亚群抽取15个细胞绘制热图
DoHeatmap(subset(sce, downsample = 15), 
          features = features, size = 3)

方法二：自定义函数

# 每个细胞亚群抽10 
allCells=names(Idents(sce))
allType = levels(Idents(sce))

choose_Cells = unlist(lapply(allType, function(x){
  cgCells = allCells[Idents(sce)== x ]
  cg=sample(cgCells,10)
  cg
}))

cg_sce = sce[, allCells %in% choose_Cells]
cg_sce
as.data.frame(table(Idents(cg_sce)))
#           Var1 Freq
# 1  Naive CD4 T   10
# 2   CD14+ Mono   10
# 3 Memory CD4 T   10
# 4            B   10
# 5        CD8 T   10
# 6 FCGR3A+ Mono   10
# 7           NK   10
# 8           DC   10
# 9     Platelet   10