很多粉丝留言想听miRNA-seq数据分析流程,主要是上游分析流程,因为下游其实就是表达矩阵分析。跟RNA-seq拿到的counts矩阵是类似的分析策略,只不过是miRNA-seq热度已经过去了,我也仅仅是五年前接触过一次。其实miRNA-seq数据上游分析有两个方案,一个是仅仅是针对已知的miRNA进行定量,这样的话无需比对到物种参考基因组,仅仅是比对到miRNA序列合集即可。另外一个挖掘新的miRNA,主要是考虑到人类对miRNA的认知的不停的进步。当然,如果你想 系统性学习,可以参考我五年前在生信菜鸟团自学miRNA-seq分析系列笔记:- 自学miRNA-seq分析第八讲~miRNA-mRNA表达相关下游分析 | 生信菜鸟团
- 自学miRNA-seq分析第七讲~miRNA样本配对mRNA表达量获取 | 生信菜鸟团
- 自学miRNA-seq分析第六讲~miRNA表达量差异分析 | 生信菜鸟团
- 自学miRNA-seq分析第五讲~miRNA表达量获取 | 生信菜鸟团
- 自学miRNA-seq分析第四讲~测序数据比对 | 生信菜鸟团
- 自学miRNA-seq分析第三讲~公共测序数据下载 | 生信菜鸟团
- 自学miRNA-seq分析第二讲~学习资料的搜集 | 生信菜鸟团
- 自学miRNA-seq分析第一讲~文献选择与解读 | 生信菜鸟团
miRBase是miRNA研究领域最权威的数据库,提供miRNAs序列以及注释查询、定位、发卡序列查询,以及提供命名服务。截止到现在(2020-04-28 ),miRBase 22.1 共收录了271个物种的总共38589 条miRNA信息。其中人类的共收录了1917条pre-miRNA(前体),以及2656条成熟的miRNAs。见:http://www./前体miRNA和成熟的miRNA的区别,前体miRNA长度一般是70–120 碱基,一般是茎环结果,也就是发夹结构,所以叫做hairpin。成熟之后,一般22 个碱基。在 http://www./ 对应的ftp网站下载如下所示文件:mkdir -p ~/reference/miRNA cd ~/reference/miRNA
wget ftp:///pub/mirbase/CURRENT/hairpin.fa.gz ## 38589 reads wget ftp:///pub/mirbase/CURRENT/mature.fa.zip ## 48885 reads wget ftp:///pub/mirbase/CURRENT/hairpin.fa.zip wget ftp:///pub/mirbase/CURRENT/genomes/hsa.gff3
unzip hairpin.fa.zip unzip mature.fa.zip
grep sapiens mature.fa |wc -l # 2656 grep sapiens hairpin.fa |wc # 1917
## Homo sapiens perl -alne '{if(/^>/){if(/Homo/){$tmp=1}else{$tmp=0}};next if $tmp!=1;s/U/T/g if !/>/;print }' hairpin.fa > hairpin.human.fa perl -alne '{if(/^>/){if(/Homo/){$tmp=1}else{$tmp=0}};next if $tmp!=1;s/U/T/g if !/>/;print }' mature.fa > mature.human.fa # 这里值得一提的是miRBase数据库下载的序列,居然都是用U表示的,也就是说就是miRNA序列,而不是转录成该miRNA的基因序列,而我们测序的都是基因序列。 5.9M Mar 12 2018 hairpin.fa 1.5M Apr 29 15:09 hairpin.fa.gz 1.5M Apr 29 15:11 hairpin.fa.zip 263K Apr 29 15:13 hairpin.human.fa 523K Apr 29 15:11 hsa.gff3 3.7M Mar 12 2018 mature.fa 786K Apr 29 15:09 mature.fa.zip 196K Apr 29 15:13 mature.human.fa 构建miRNA-seq数据分析环境只需要使用上面的的文件即可。仍然是参考我五年前整理的流程,使用bowtie软件和SHRiMP软件进行比对,然后fastx_toolkit 和fastqc进行质量控制。conda create -n miRNA conda activate miRNA conda install -y -c bioconda hisat2 bowtie samtools fastp bowtie2 fastx_toolkit fastqc # sra-toolkits,subreads 也可以一并下载 # conda search fastx_toolkit # 耗费约1.2G的磁盘空间 因为SHRiMP软件太多年没有更新,所以放弃它,反正比对这个过程,有bowtie就ok了。针对miRBase数据库下载的序列构建bowtie索引需要注意的是bowtie和bowtie2不一样哦:- http://bowtie-bio./index.shtml
- http://bowtie-bio./bowtie2/index.shtml
bowtie-build mature.human.fa hsa-mature-bowtie-index bowtie-build hairpin.human.fa hsa-hairpin-bowtie-index 4.2M Apr 29 15:42 hsa-hairpin-bowtie-index.1.ebwt 20K Apr 29 15:42 hsa-hairpin-bowtie-index.2.ebwt 17K Apr 29 15:42 hsa-hairpin-bowtie-index.3.ebwt 39K Apr 29 15:42 hsa-hairpin-bowtie-index.4.ebwt 4.2M Apr 29 15:42 hsa-hairpin-bowtie-index.rev.1.ebwt 20K Apr 29 15:42 hsa-hairpin-bowtie-index.rev.2.ebwt 4.2M Apr 29 15:41 hsa-mature-bowtie-index.1.ebwt 7.1K Apr 29 15:41 hsa-mature-bowtie-index.2.ebwt 24K Apr 29 15:41 hsa-mature-bowtie-index.3.ebwt 15K Apr 29 15:41 hsa-mature-bowtie-index.4.ebwt 4.2M Apr 29 15:41 hsa-mature-bowtie-index.rev.1.ebwt 7.1K Apr 29 15:41 hsa-mature-bowtie-index.rev.2.ebwt 这里我们仍然是使用 RNA expression profiling of human iPSC-derived cardiomyocytes in a cardiac hypertrophy model. PLoS One 2014;9(9):e108051. PMID: 25255322 文章里面的 https://www.ncbi.nlm./geo/query/acc.cgi?acc=GSE60292 数据集,如下:GSM1470353 control-CM, experiment1 GSM1470354 ET1-CM, experiment1 GSM1470355 control-CM, experiment2 GSM1470356 ET1-CM, experiment2 GSM1470357 control-CM, experiment3 GSM1470358 ET1-CM, experiment3 SRR1542714 SRR1542715 SRR1542716 SRR1542717 SRR1542718 SRR1542719 ftp://ftp.sra.ebi.ac.uk/vol1/srr/SRR154/004/SRR1542714;ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR154/004/SRR1542714/SRR1542714.fastq.gz # 从14到19 # 可以对这6个样本,分开prefetch ~/biosoft/sratoolkit/sratoolkit.2.8.2-1-centos_linux64/bin/prefetch SRR1542714 mkdir -p ~/reference/miRNA cd ~/reference/miRNA # step1 : download raw data mkdir miRNA_test && cd miRNA_test echo {14..19} |sed 's/ /\n/g' |while read id; \ do (~/biosoft/sratoolkit/sratoolkit.2.8.2-1-centos_linux64/bin/prefetch SRR15427${id} ) ;\ done # 下面是另外一个课题,可以参考比较 echo {44..59} |sed 's/ /\n/g' |while read id; \ do (~/biosoft/sratoolkit/sratoolkit.2.8.2-1-centos_linux64/bin/prefetch SRR77077${id} ) ;\ done # 另外,这样的下载方式,在中国大陆会非常慢!!! # 建议换成aspera哈 33M Apr 29 16:07 SRR1542714.sra 31M Apr 29 16:07 SRR1542715.sra 50M Apr 29 16:07 SRR1542716.sra 24M Apr 29 16:07 SRR1542717.sra 52M Apr 29 16:07 SRR1542718.sra 34M Apr 29 16:07 SRR1542719.sra dump=/home/yb77613/biosoft/sratoolkit/sratoolkit.2.8.2-1-centos_linux64/bin/fastq-dump echo {14..19} |sed 's/ /\n/g' |while read id; \ do ( $dump -O ./ --gzip --split-3 SRR15427${id} ) ;\ done 50M Apr 29 16:14 SRR1542714.fastq.gz 47M Apr 29 16:15 SRR1542715.fastq.gz 75M Apr 29 16:15 SRR1542716.fastq.gz 35M Apr 29 16:15 SRR1542717.fastq.gz 81M Apr 29 16:16 SRR1542718.fastq.gz 50M Apr 29 16:16 SRR1542719.fastq.gz 清洗前后,都走一下fastqc图表,清洗主要是fastx_toolkit修剪,代码如下:ls *gz |while read id do echo $id # fastqc $id zcat $id| fastq_quality_filter -v -q 20 -p 80 -Q33 -i - -o tmp ; fastx_trimmer -v -f 1 -l 27 -m 15 -i tmp -Q33 -z -o ${id%%.*}_clean.fq.gz ; # fastqc ${id%%.*}_clean.fq.gz done fastq_quality_filter和fastx_trimmer两个命令,都是来自于fastx_toolkit软件包:- fastq_quality_filter 代表根据质量过滤序列
- fastx_trimmer 代表缩短FASTQ或FASTQ文件中的读数,根据质量修剪(剪切)序列。
#运行一次查看是否可以成功运行 #fastq_quality_filter 代表根据质量过滤序列 #-v,即verbose,报告序列的数目 #-q,即quality,保留碱基所要求的最小质量评分,低于此值将会去除 #-p,即percent,即序列中超过最小质量评分的碱基数目的最小百分率,低于这个比率将删除 #-Q33,即phred33,默认是按照phred64作为参考的 #-i,即infile,代表输入文件,注意不能是压缩文件,可以是FASTA/FASTQ都行。默认是STDIN,即标准输入 #-o,即outfile,代表输出文件(注意不是output dir即输出目录,此处输出是一个文件而不是文件夹),默认是STDOUT即标准输出,指定则输出到指定文件 fastq_quality_filter -v -q 20 -p 80 -Q33 -i SRR1542714.1.fastq -o tmp #生成第一步处理的临时文件
# 中间文件是 tmp ,到时候可以删除掉。
#fastx_trimmer 代表缩短FASTQ或FASTQ文件中的读数,根据质量修剪(剪切)序列。 #-v,即verbose,报告序列的数目 #-f,即first,代表保留第几个碱基,默认是保留第一个 #-l,即last,代表保留最后的碱基,默认是整个reads #-i,即infile,代表输入文件,注意不能是压缩文件,可以是FASTA/FASTQ都行。默认是STDIN,即标准输入 #-z,即compress,代表的是使用Gzip压缩输出 #-o,即outfile,代表输出文件(注意不是output dir即输出目录,此处输出是一个文件而不是文件夹),默认是STDOUT即标准输出,指定则输出到指定文件 fastx_trimmer -v -f 1 -l 27 -i tmp -Q33 -z -o SRR1542714.1_clean.fq.gz #生成最后的文件 所以 fastq_quality_filter 和 fastx_trimmer命令是有替代品的,就是需要去自行搜索,如果你是要建立好用的miRNA-seq数据分析环境,就需要自己耗费大量时间在里面哦。39M Apr 29 16:28 SRR1542714_clean.fq.gz 37M Apr 29 16:28 SRR1542715_clean.fq.gz 49M Apr 29 16:29 SRR1542716_clean.fq.gz 18M Apr 29 16:29 SRR1542717_clean.fq.gz 68M Apr 29 16:30 SRR1542718_clean.fq.gz 30M Apr 29 16:30 SRR1542719_clean.fq.gz mature=/home/yb77613/reference/miRNA/hsa-mature-bowtie-index hairpin=/home/yb77613/reference/miRNA/hsa-hairpin-bowtie-index ls *_clean.fq.gz |while read id do echo $id bowtie -n 0 -m1 --best --strata $mature $id -S ${id}_matrue.sam bowtie -n 0 -m1 --best --strata $hairpin $id -S ${id}_hairpin.sam done
ls *.sam|while read id ;do (samtools sort -O bam -@ 5 -o $(basename ${id} ".sam").bam ${id});done rm *.sam 使用miRNA序列比对的推荐参数:-n 0 -m1 --best --strata ,理由参考:SRR1542714_clean.fq.gz # reads processed: 1520320 # reads with at least one reported alignment: 331645 (21.81%) # reads that failed to align: 1145386 (75.34%) # reads with alignments suppressed due to -m: 43289 (2.85%) Reported 331645 alignments # reads processed: 1520320 # reads with at least one reported alignment: 331482 (21.80%) # reads that failed to align: 1008271 (66.32%) # reads with alignments suppressed due to -m: 180567 (11.88%) Reported 331482 alignments 可以看到,比对率还是蛮低的,但是可以调整参数(允许错配碱基)的数量。29M Apr 29 20:15 SRR1542714_clean.fq.gz_hairpin.bam 28M Apr 29 20:15 SRR1542714_clean.fq.gz_matrue.bam 27M Apr 29 20:15 SRR1542715_clean.fq.gz_hairpin.bam 26M Apr 29 20:15 SRR1542715_clean.fq.gz_matrue.bam 36M Apr 29 20:15 SRR1542716_clean.fq.gz_hairpin.bam 35M Apr 29 20:15 SRR1542716_clean.fq.gz_matrue.bam 13M Apr 29 20:15 SRR1542717_clean.fq.gz_hairpin.bam 13M Apr 29 20:15 SRR1542717_clean.fq.gz_matrue.bam 49M Apr 29 20:15 SRR1542718_clean.fq.gz_hairpin.bam 48M Apr 29 20:15 SRR1542718_clean.fq.gz_matrue.bam 22M Apr 29 20:15 SRR1542719_clean.fq.gz_hairpin.bam 22M Apr 29 20:15 SRR1542719_clean.fq.gz_matrue.bam 批量走 samtools idxstats 得到表达量矩阵:ls *.bam |xargs -i samtools index {} ls *.bam|while read id ;do (samtools idxstats ${id} > ${id}.txt );done # samtools view matrue.bam |cut -f 3 |sort |uniq -c index=/home/yb77613/reference/index/bowtie/hg38 ls *_clean.fq.gz |while read id do echo $id bowtie2 -p 4 -x $index -U $id | samtools sort -@ 4 -o ${id}_genome.bam - done 使用默认参数,对其中一个样本的比对的log日志如下:SRR1542715_clean.fq.gz 1520320 reads; of these: 1520320 (100.00%) were unpaired; of these: 123178 (8.10%) aligned 0 times 333700 (21.95%) aligned exactly 1 time 1063442 (69.95%) aligned >1 times 91.90% overall alignment rate 可以看到,这个时候的比对率不得了,得到的bam文件如下:34M Apr 29 20:30 SRR1542714_clean.fq.gz_genome.bam 31M Apr 29 20:30 SRR1542715_clean.fq.gz_genome.bam 42M Apr 29 20:31 SRR1542716_clean.fq.gz_genome.bam 15M Apr 29 20:31 SRR1542717_clean.fq.gz_genome.bam 57M Apr 29 20:32 SRR1542718_clean.fq.gz_genome.bam 25M Apr 29 20:32 SRR1542719_clean.fq.gz_genome.bam 然后定量,需要使用下面的命令和参数,都是需要看软件文档摸索的。gtf=/home/yb77613/reference/miRNA/hsa.gff3 bin_featureCounts="/home/yb77613/biosoft/featureCounts/subread-1.5.3-Linux-x86_64/bin/featureCounts";
$bin_featureCounts -T 4 -F gff -M -t miRNA -g Name -a $gtf -o all.counts.mature.txt *genome* 1>counts.mature.log 2>&1
$bin_featureCounts -T 4 -F gff -M -t miRNA_primary_transcript -g Name -a $gtf -o all.counts.hairpin.txt *genome* 1>counts.hairpin.log 2>&1 这样我们就有不同的定量流程拿到的不同的表达矩阵啦。在featureCounts软件的结果文件 all.counts.id.txt 里面的信息如下:hsa-miR-12136 chr1 632668 632685 - 18 26 21 50 6 78 19 hsa-miR-34a-5p chr1 9151735 9151756 - 22 3009 9494 3467 4136 5299 7097 hsa-miR-34a-3p chr1 9151693 9151714 - 22 88 153 100 86 126 132 查询其中一个,发现featureCounts软件定量少了80%,比如hsa-miR-12136在featureCounts流程,是 18 26 21 50 6 78 19,但是在miRBase数据库流程,都是五倍以上的表达量。SRR1542714_clean.fq.gz_matrue.bam.txt:hsa-miR-12136 18 104 0 SRR1542715_clean.fq.gz_matrue.bam.txt:hsa-miR-12136 18 182 0 SRR1542716_clean.fq.gz_matrue.bam.txt:hsa-miR-12136 18 109 0 SRR1542717_clean.fq.gz_matrue.bam.txt:hsa-miR-12136 18 44 0 SRR1542718_clean.fq.gz_matrue.bam.txt:hsa-miR-12136 18 235 0 SRR1542719_clean.fq.gz_matrue.bam.txt:hsa-miR-12136 18 92 0 查询另外2个,有意思的是其中一个发现featureCounts软件定量多了1倍,另外一个多了2倍。SRR1542714_clean.fq.gz_matrue.bam.txt:hsa-miR-34a-5p 22 1745 0 SRR1542715_clean.fq.gz_matrue.bam.txt:hsa-miR-34a-5p 22 5825 0 SRR1542716_clean.fq.gz_matrue.bam.txt:hsa-miR-34a-5p 22 2015 0 SRR1542717_clean.fq.gz_matrue.bam.txt:hsa-miR-34a-5p 22 2605 0 SRR1542718_clean.fq.gz_matrue.bam.txt:hsa-miR-34a-5p 22 3081 0 SRR1542719_clean.fq.gz_matrue.bam.txt:hsa-miR-34a-5p 22 4413 0
SRR1542714_clean.fq.gz_matrue.bam.txt:hsa-miR-34a-3p 22 25 0 SRR1542715_clean.fq.gz_matrue.bam.txt:hsa-miR-34a-3p 22 69 0 SRR1542716_clean.fq.gz_matrue.bam.txt:hsa-miR-34a-3p 22 21 0 SRR1542717_clean.fq.gz_matrue.bam.txt:hsa-miR-34a-3p 22 27 0 SRR1542718_clean.fq.gz_matrue.bam.txt:hsa-miR-34a-3p 22 37 0 SRR1542719_clean.fq.gz_matrue.bam.txt:hsa-miR-34a-3p 22 57 0 比如,hsa-miR-34a-5p在featureCounts流程,是3009 9494 3467 4136 5299 7097,但是在miRBase数据库流程,都只有一半的表达量。对hsa-miR-34a-3p来说,在featureCounts流程,是88 153 100 86 126 132,但是在miRBase数据库流程,都是两倍的表达量。这样的定量差异,理论上是不会改变该miRNA的差异分析结果,因为是同样的膨胀系数。但是,这样的不确定性,让我们的miRNA-seq数据分析,蒙上了一层阴影。老规矩,我们的拉群小助手会协助大家进入一个miRNA-seq数据分析交流群哈, 跟我们之前的其它群类似:
还是老规矩,18.8元进群,一个简单的门槛,隔绝那些营销号!同时,我们也会在群里共享一些miRNA-seq数据分析相关资料,仅此而已,考虑清楚哦!
|