学习单细胞转录组肯定先来一遍Seurat官网的标准流程。 数据来源于Peripheral Blood Mononuclear Cells (PBMC),共2700个单细胞, Illumina NextSeq 500平台。下载链接在这:https://cf./samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz 解压到目录下,有以下三个文件,也是10X的标准文件 barcodes.tsv ,genes.tsv, matrix.mtx 一 加载R包 数据集 Seurat可以直接用Read10X函数读取cellranger的结果数据,使用pbmc数据初始化Seurat对象 library(dplyr)library(Seurat) library(patchwork) # Load the PBMC dataset pbmc.data <- Read10X(data.dir = "./data/hg19/") #解压缩后的路径 # Initialize the Seurat object with the raw (non-normalized data). pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200) pbmc 查看count matrix矩阵 #查看数据dim(pbmc.data) # 32738 2700 # Lets examine a few genes in the first thirty cells pbmc.data[c("CD3D", "TCL1A", "MS4A1"), 1:30] 3 x 30 sparse Matrix of class "dgCMatrix" [[ suppressing 30 column names 'AAACATACAACCAC-1’, 'AAACATTGAGCTAC-1’, 'AAACATTGATCAGC-1’ ... ]] CD3D 4 . 10 . . 1 2 3 1 . . 2 7 1 . . 1 3 . 2 3 . . . . . 3 4 1 5 TCL1A . . . . . . . . 1 . . . . . . . . . . . . 1 . . . . . . . . MS4A1 . 6 . . . . . . 1 1 1 . . . . . . . . . 36 1 2 . . 2 . . . . 其中的.表示0,即no molecules detected。seurat使用一个稀疏矩阵来保存count matrix,节省存储空间。 二 数据预处理 2.1 QC 一般使用以下三个标准,也可以参考commonly used
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-") # Show QC metrics for the first 5 cells head(pbmc@meta.data, 5)## orig.ident nCount_RNA nFeature_RNA percent.mt ## AAACATACAACCAC-1 pbmc3k 2419 779 3.0177759 ## AAACATTGAGCTAC-1 pbmc3k 4903 1352 3.7935958 ## AAACATTGATCAGC-1 pbmc3k 3147 1129 0.8897363 ## AAACCGTGCTTCCG-1 pbmc3k 2639 960 1.7430845 ## AAACCGTGTATGCG-1 pbmc3k 980 521 1.2244898 注意PercentageFeatureSet函数可以计算每个CELL中的指定基因子集的计数百分比。 小提琴图:查看基因数目, UMI数目, 线粒体基因占比 # Visualize QC metrics as a violin plotVlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3) 基因数目, 线粒体基因占比与UMI数目相关性 # FeatureScatter is typically used to visualize feature-feature relationships, but can be used# for anything calculated by the object, i.e. columns in object metadata, PC scores etc. plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt") plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") plot1 + plot2 过滤
2.2 数据标准化默认标准化方法为LogNormalize,标化后的数据存在pbmc[["RNA"]]@data中。 pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)#pbmc <- NormalizeData(pbmc) 如果我记不住这些[[ 和 @ 怎么办? 使用最“简单,通用”的方式,str(pbmc) 查看一下数据结构,然后根据结构对应使用@ 或者 $ 。 str(pbmc)#截取部分展示用 Formal class 'Seurat' [package "SeuratObject"] with 13 slots ..@ assays :List of 1 .. ..$ RNA:Formal class 'Assay' [package "SeuratObject"] with 8 slots .. .. .. ..@ counts :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots .. .. .. .. .. ..@ i : int [1:2238732] 29 73 80 148 163 184 186 227 229 230 ... .. .. .. .. .. ..@ p : int [1:2639] 0 779 2131 3260 4220 4741 5522 6304 7094 7626 ... .. .. .. .. .. ..@ Dim : int [1:2] 13714 2638 .. .. .. .. .. ..@ Dimnames:List of 2 .. .. .. .. .. .. ..$ : chr [1:13714] "AL627309.1" "AP006222.2" "RP11-206L10.2" "RP11-206L10.9" ... .. .. .. .. .. .. ..$ : chr [1:2638] "AAACATACAACCAC-1" "AAACATTGAGCTAC-1" "AAACATTGATCAGC-1" "AAACCGTGCTTCCG-1" ... .. .. .. .. .. ..@ x : num [1:2238732] 1 1 2 1 1 1 1 41 1 1 ... .. .. .. .. .. ..@ factors : list() .. .. .. ..@ data :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots .. .. .. .. .. ..@ i : int [1:2238732] 29 73 80 148 163 184 186 227 229 230 ... .. .. .. .. .. ..@ p : int [1:2639] 0 779 2131 3260 4220 4741 5522 6304 7094 7626 ... .. .. .. .. .. ..@ Dim : int [1:2] 13714 2638 .. .. .. .. .. ..@ Dimnames:List of 2 .. .. .. .. .. .. ..$ : chr [1:13714] "AL627309.1" "AP006222.2" "RP11-206L10.2" "RP11-206L10.9" ... .. .. .. .. .. .. ..$ : chr [1:2638] "AAACATACAACCAC-1" "AAACATTGAGCTAC-1" "AAACATTGATCAGC-1" "AAACCGTGCTTCCG-1" ... .. .. .. .. .. ..@ x : num [1:2238732] 1.64 1.64 2.23 1.64 1.64 ... .. .. .. .. .. ..@ factors : list() .. .. .. ..@ scale.data : num[0 , 0 ] .. .. .. ..@ key : chr "rna_" .. .. .. ..@ assay.orig : NULL .. .. .. ..@ var.features : chr(0) .. .. .. ..@ meta.features:'data.frame':13714 obs. of 0 variables .. .. .. ..@ misc : list() ..@ meta.data :'data.frame':2638 obs. of 5 variables: .. ..$ orig.ident : Factor w/ 1 level "pbmc3k": 1 1 1 1 1 1 1 1 1 1 ... .. ..$ nCount_RNA : num [1:2638] 2419 4903 3147 2639 980 ... .. ..$ nFeature_RNA: int [1:2638] 779 1352 1129 960 521 781 782 790 532 550 ... .. ..$ percent.mt : num [1:2638] 3.02 3.79 0.89 1.74 1.22 ... .. ..$ percent.HB : num [1:2638] 0 0 0 0 0 0 0 0 0 0 ... 根据层级结构找到 data 即可, pbmc@assays$RNA@data (标准化后的数据) 另:pbmc@assays$RNA@counts为原始count数据 简单看一下标准化前后的数据par(mfrow = c(1,2))hist(colSums(pbmc$RNA@counts),breaks = 50) hist(colSums(pbmc$RNA@data),breaks = 50) 2.3 高变基因(特征选择)选择数据集中高变异的特征子集(在某些细胞中高表达,在其他细胞中低表达)。通过Seurat内置的FindVariableFeatures()函数,计算每一个基因的均值和方差,默认选择高变的2000个基因用于下游分析。 标示Top10的基因 以及 标示目标基因 pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)# Identify the 10 most highly variable genes top10 <- head(VariableFeatures(pbmc), 10) # plot variable features with and without labels plot1 <- VariableFeaturePlot(pbmc) #标示TOP10的基因 plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE) #标示感兴趣的基因 plot3 <- LabelPoints(plot = plot1, points = c(top10,"HLA-DPB1","CCL4"), repel = TRUE) plot2 + plot3 2.4 归一化使用ScaleData()函数线性转换("归一化")数据,使得每一个基因在所有cell中的表达均值为0,方差为1. 结果在pbmc[["RNA"]]@scale.data #同样可以通过str的方式根据数据结构获取。 all.genes <- rownames(pbmc)pbmc <- ScaleData(pbmc, features = all.genes) 有需要的话也可以移出不必要的来源数据? pbmc <- ScaleData(pbmc, vars.to.regress = "percent.mt")三 PCA 降维 print(pbmc[["pca"]], dims = 1:5, nfeatures = 5) PC_ 1 Positive: CST3, TYROBP, LST1, AIF1, FTL Negative: MALAT1, LTB, IL32, IL7R, CD2 PC_ 2 Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1 Negative: NKG7, PRF1, CST7, GZMA, GZMB PC_ 3 Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPA1 Negative: PPBP, PF4, SDPR, SPARC, GNG11 PC_ 4 Positive: HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1 Negative: VIM, IL7R, S100A6, S100A8, IL32 PC_ 5 Positive: GZMB, FGFBP2, S100A8, NKG7, GNLY Negative: LTB, IL7R, CKB, MS4A7, RP11-290F20.3 3.2 可视化展示 1)dims指定展示几个PCs , nfeatures指定每个PC展示多少基因 VizDimLoadings(pbmc, dims = 1:2,nfeatures = 20, reduction = "pca") 2) PCA点图 DimPlot(pbmc, reduction = "pca")3) PCA热图 DimHeatmap(pbmc,dims = 1:4, #几个PCs cells = 600, #多少cell ncol = 2, #图形展示几列 balanced = TRUE) 四 确定维度 如何确定使用多少PCA呢?以下2种方法科研作为参考 4.1 JackStrawPlot使用JackStrawPlot()函数对PCA结果进行可视化,"重要"的PC 将会显示在虚线上方且P值较低,本示例中PC10-PC12后显著性下降较明显。较耗时。 # NOTE: This process can take a long time for big datasets, comment out for expediency. More# approximate techniques such as those implemented in ElbowPlot() can be used to reduce # computation time pbmc <- JackStraw(pbmc, num.replicate = 100) pbmc <- ScoreJackStraw(pbmc, dims = 1:20) JackStrawPlot(pbmc, dims = 1:15) 4.2 Elbow图展示每个主成分对数据方差的解释情况(百分比表示),并进行排序。发现第9个主成分是一个拐点,后续的主成分(PC)变化不大。 ElbowPlot(pbmc)A:以上结果“供参考”; B:Seurat官网鼓励用户使用不同数量的PC(10、15,甚至50)重复下游分析,虽然结果通常没有显著差异。 C:建议设置此参数时偏高一些,较少维度进行下游分析可能会对结果产生一些负面影响。 4.3 显著相关基因这个也可以作为后面分析选择基因的一个参考。 #Returns a set of genes, based on the JackStraw analysis, that have statistically significant associations with a set of PCs.?PCASigGenes head(PCASigGenes(pbmc,pcs.use=2,pval.cut = 0.7)) [1] "PPBP" "LYZ" "S100A9" "IGLL5" "GNLY" "FTL" 五 聚类 基于PCA结果进行聚类; 大约3K细胞的单细胞数据集,将resolution参数设置在0.4-1.2之间,数据集增加resolution 值对应增加。这个参数的设置目前没有找到标准,有清楚的小伙伴欢迎后台告知,谢谢。 pbmc <- FindNeighbors(pbmc, dims = 1:10)pbmc <- FindClusters(pbmc, resolution = 0.5) # Look at cluster IDs of the first 5 cells head(Idents(pbmc), 5) head(pbmc@active.ident,5) #同上 AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1 AAACCGTGTATGCG-1 0 3 2 1 6 Levels: 0 1 2 3 4 5 6 7 8 5.1 查看指定cluster的cellhead(subset(as.data.frame(pbmc@active.ident),pbmc@active.ident=="2"))pbmc@active.identAAACATTGATCAGC-1 2AAACGCACTGGTAC-1 2 AAAGCCTGTATGCG-1 2 AAATCAACTCGCAA-1 2 AAATGTTGCCACAA-1 2 AACACGTGCAGAGG-1 2 5.2 提取某一cluster细胞subpbmc<-subset(x = pbmc,idents="2")subpbmc head(subpbmc@active.ident,5) AAACATTGATCAGC-1 AAACGCACTGGTAC-1 AAAGCCTGTATGCG-1 AAATCAACTCGCAA-1 AAATGTTGCCACAA-1 2 2 2 2 2 Levels: 2 六 非线性降维聚类 建议使用与聚类分析相同的pc维度进行非线性降维度分析,如tSNE和UMAP,并可视化展示。 6.1 UMAP# If you haven't installed UMAP, you can do so via reticulate::py_install(packages =# 'umap-learn') pbmc <- RunUMAP(pbmc, dims = 1:10) # note that you can set `label = TRUE` or use the LabelClusters function to help label # individual clusters DimPlot(pbmc, reduction = "umap") 6.2 tSNEpbmc <- RunTSNE(pbmc, dims = 1:10)head(pbmc@reductions$tsne@cell.embeddings) DimPlot(pbmc, reduction = "tsne") 6.3 比较# note that you can set `label = TRUE` or use the LabelClusters function to help label# individual clusters plot1<-DimPlot(pbmc, reduction = "umap",label = TRUE) plot2<-DimPlot(pbmc, reduction = "tsne",label = TRUE) plot1 + plot2 可以看出,两者的降维可视化的结构是一致的,UMAP方法相对更加紧凑。 七 差异表达基因 Seurat可以通过FindMarkers函数 和 FindAllMarkers函数寻找不同cluster的差异表达基因。 min.pct参数:设定在两个细胞群中任何一个被检测到的百分比,通过此设定不检测很少表达基因来缩短程序运行时间。默认0.1 thresh.test参数:设定在两个细胞群中基因差异表达量。可以设置为0 ,程序运行时间会更长。 max.cells.per.ident参数:每个类群细胞抽样设置;也可以缩短程序运行时间。 7.1 特定cluster是one-others的差异分析方法,由ident.1来制定cluster,本例就是cluster2与其余的cluster来做比较。 # find all markers of cluster 2cluster2.markers <- FindMarkers(pbmc, ident.1 = 2, min.pct = 0.25) head(cluster2.markers, n = 5)## p_val avg_log2FC pct.1 pct.2 p_val_adj ## IL32 2.593535e-91 1.2154360 0.949 0.466 3.556774e-87 ## LTB 7.994465e-87 1.2828597 0.981 0.644 1.096361e-82 ## CD3D 3.922451e-70 0.9359210 0.922 0.433 5.379250e-66 ## IL7R 1.130870e-66 1.1776027 0.748 0.327 1.550876e-62 ## LDHB 4.082189e-65 0.8837324 0.953 0.614 5.598314e-61 7.2 指定cluster本例为cluster5 和 cluster0_cluster3的差异 # find all markers distinguishing cluster 5 from clusters 0 and 3cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25) head(cluster5.markers, n = 5) p_val avg_log2FC pct.1 pct.2 p_val_adjFCGR3A 8.331882e-208 4.261784 0.975 0.040 1.142634e-203 CFD 1.932644e-198 3.423863 0.938 0.036 2.650429e-194 IFITM3 2.710023e-198 3.876058 0.975 0.049 3.716525e-194 CD68 1.069778e-193 3.013656 0.926 0.035 1.467094e-189 RP11-290F20.3 4.218926e-190 2.722303 0.840 0.016 5.785835e-186 7.3 FindAllMarkers每个cluster分别与其他所有cluster进行比较,展示各个cluster的前2个基因 # find markers for every cluster compared to all remaining cells, report only the positive onespbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25) pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_log2FC) # A tibble: 18 x 7# Groups: cluster [9]p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr> 1 1.88e-117 1.08 0.913 0.588 2.57e-113 0 LDHB 2 5.01e- 85 1.34 0.437 0.108 6.88e- 81 0 CCR7 3 0 5.57 0.996 0.215 0 1 S100A9 4 0 5.48 0.975 0.121 0 1 S100A8 5 1.93e- 80 1.27 0.981 0.65 2.65e- 76 2 LTB 6 2.91e- 58 1.27 0.667 0.251 3.98e- 54 2 CD2 7 0 4.31 0.939 0.042 0 3 CD79A 8 1.06e-269 3.59 0.623 0.022 1.45e-265 3 TCL1A 9 3.60e-221 3.21 0.984 0.226 4.93e-217 4 CCL5 10 4.27e-176 3.01 0.573 0.05 5.85e-172 4 GZMK 11 3.51e-184 3.31 0.975 0.134 4.82e-180 5 FCGR3A 12 2.03e-125 3.09 1 0.315 2.78e-121 5 LST1 13 3.17e-267 4.83 0.961 0.068 4.35e-263 6 GZMB 14 1.04e-189 5.28 0.961 0.132 1.43e-185 6 GNLY 15 1.48e-220 3.87 0.812 0.011 2.03e-216 7 FCER1A 16 1.67e- 21 2.87 1 0.513 2.28e- 17 7 HLA-DPB1 17 7.73e-200 7.24 1 0.01 1.06e-195 8 PF4 18 3.68e-110 8.58 1 0.024 5.05e-106 8 PPBP ?FindAllMarkers查看更多参数。 cluster0.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25,test.use = "roc", # 检验的方式 only.pos = TRUE) #只输出pos的基因 其中test.use 可选参数有:wilcox(默认),bimod,roc,t,negbinom,poisson,LR,MAST,DESeq2。 7.4 可视化可以通过seurat的内置函数查看重点基因的基本情况: 1)VlnPlot: 基因表达概率分布 VlnPlot(pbmc, features = c("MS4A1", "CD79A"))# you can plot raw counts as well VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)
"CD8A")) 3)DoHeatmap()查看重点基因细胞和cluster的表达热图 top10 <- pbmc.markers %>%group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC) DoHeatmap(pbmc, features = top10$gene) + NoLegend() 此外,还建议使用RidgePlot()、CellScatter()和DotPlot()查看数据集的情况。 这也可以作为后续手动注释的一些参考。 参考资料:https:///seurat/articles/pbmc3k_tutorial.html#standard-pre-processing-workflow-1 ◆ ◆ ◆ ◆ ◆ 精心整理(含图PLUS版)|R语言生信分析,可视化(R统计,ggplot2绘图,生信图形可视化汇总)
|
|