【原】10X genomics单细胞数据集探索

健明 2021-07-15

展开全文

其官网公布了非常多的数据集： https://support./single-cell-gene-expression/datasets 只需要简单填写邮箱即可下载，如下：

Cell Ranger 2.1.0

每一个数据集都公布了原始的fastq数据以及比对好的bam文件，和定量后的表达矩阵以及聚类分析结果，用的是10X genomics公司自己的生物信息学分析流程，Single Cell Gene Expression Dataset by Cell Ranger 2.1.0

1k Brain Cells from an E18 Mouse

Cells from a combined cortex, hippocampus and sub ventricular zone of an E18 mouse.

931 cells detected
Sequenced on Illumina HiSeq2500 with approximately 56,000 reads per cell
26bp read1 (16bp Chromium barcode and 10bp UMI), 98bp read2 (transcript), and 8bp I7 sample barcode
Analysis run with —cells=2000

原始数据非常大，这里就选择上面这个接近1000个小鼠大脑细胞的数据集来测试 Cell Ranger 2.1.0 流程。

直接下载处理好的结果

因为细胞数量较多，哪怕是纯粹的表达矩阵，也很大，我下载了几个准备去探索，如下：

├── [ 76M]  neuron_9k_analysis.tar
├── [112M]  neuron_9k_cloupe.cloupe
├── [284M]  neuron_9k_filtered_gene_bc_matrices.tar
├── [507M]  neuron_9k_raw_gene_bc_matrices.tar
├── [ 67M]  pbmc4k_analysis.tar
├── [ 35M]  pbmc4k_cloupe.cloupe
├── [ 69M]  pbmc4k_filtered_gene_bc_matrices.tar
├── [133M]  pbmc4k_raw_gene_bc_matrices.tar
├── [ 78M]  pbmc8k_analysis.tar
├── [ 63M]  pbmc8k_cloupe.cloupe
├── [143M]  pbmc8k_filtered_gene_bc_matrices.tar
├── [253M]  pbmc8k_raw_gene_bc_matrices.tar
├── [ 58M]  t_4k_analysis.tar
├── [ 29M]  t_4k_cloupe.cloupe
├── [ 60M]  t_4k_filtered_gene_bc_matrices.tar
└── [131M]  t_4k_raw_gene_bc_matrices.tar

下载原始fastq格式的测序数据

这里仍然是下载1k Brain Cells from an E18 Mouse，最小的数据集，做测试用：

├── [237M]  neurons_900_S1_L001_I1_001.fastq.gz
├── [642M]  neurons_900_S1_L001_R1_001.fastq.gz
├── [1.8G]  neurons_900_S1_L001_R2_001.fastq.gz
├── [238M]  neurons_900_S1_L002_I1_001.fastq.gz
├── [646M]  neurons_900_S1_L002_R1_001.fastq.gz
└── [1.8G]  neurons_900_S1_L002_R2_001.fastq.gz

可以看到左右端测序数据大小不一致，而且每次测序是有3个数据，因为26bp read1 (16bp Chromium barcodeand 10bp UMI), 98bp read2 (transcript), and 8bp I7 sample barcode ，只有reads2的fastq里面是真正的转录本序列，另外的两个文件都是barcode！

比对并且定量

可以直接用 Cell Ranger 来做分析，代码如下：

/home/jianmingzeng/biosoft/10xgenomic/cellranger-2.1.0/cellranger count --id=neurons \
--localcores 5 \
--transcriptome=/home/jianmingzeng/biosoft/10xgenomic/db/refdata-cellranger-mm10-1.2.0 \
--fastqs=/home/jianmingzeng/data/public/10x/neurons_900_fastqs   \
--expect-cells=900

得到的结果如下：

├── [ 18M]  cloupe.cloupe
├── [  17]  filtered_gene_bc_matrices
│   └── [  58]  mm10
│       ├── [ 15K]  barcodes.tsv
│       ├── [723K]  genes.tsv
│       └── [ 29M]  matrix.mtx
├── [4.1M]  filtered_gene_bc_matrices_h5.h5
├── [ 680]  metrics_summary.csv
├── [ 96M]  molecule_info.h5
├── [5.4G]  possorted_genome_bam.bam
├── [3.5M]  possorted_genome_bam.bam.bai
├── [  17]  raw_gene_bc_matrices
│   └── [  58]  mm10
│       ├── [ 13M]  barcodes.tsv
│       ├── [723K]  genes.tsv
│       └── [ 70M]  matrix.mtx
├── [ 10M]  raw_gene_bc_matrices_h5.h5
└── [3.2M]  web_summary.html

其中analysis文件夹里面的东西比较多，就不列出了。其中比较占空间的就是比对好的bam文件而已，其它的都可以下载到本地电脑查看。

其中比较重要的就是 filtered_gene_bc_matrices文件夹下面的表达矩阵了，可以直接被R包Seurat读入进行一系列的处理

library(Seurat)
library(dplyr)
library(Matrix)
neurons.data <- Read10X(data.dir = "~/outs/filtered_gene_bc_matrices/mm10/")
# Examine the memory savings between regular and sparse matrices
dense.size <- object.size(x = as.matrix(x = neurons.data))
dense.size
sparse.size <- object.size(x = neurons.data)
sparse.size
dense.size / sparse.size
neurons <- CreateSeuratObject(raw.data = neurons.data, min.cells = 3, min.genes = 200, 
    project = "10X_neurons")
neurons

完整笔记见：单细胞转录组3大R包之Seurat