【原】微生物DNA测序数据找变异位点

健明 2022-04-19

展开全文

微生物，比如 archaea, bacteria, fungi, or virus;只需要它有自己的参考基因组，就可以进行序列比对，并且寻找其与参考基因组不一样的地方，就是我们俗称的变异位点。

我们以这三年疫情的微生物SARS– CoV-2为例，文章：《Genomic Diversity of Severe Acute Respiratory Syndrome– Coronavirus 2 in Patients With Coronavirus Disease 2019》，它就列出来了微生物的DNA测序找变异位点的流程，主要是4个软件，步骤如下所示：

Clean reads were mapped to the reference genome of SARS–CoV-2 (GenBank MN908947.3), using BWA mem software (version 0.7.12)
Duplicate reads were removed using Picard software (http://broadinstitute./picard; version 2.18.22).
The mpileup file was generated using samtools software (version 1.8)
intrahost variants were identified using VarScan software (version 2.3.9)

这4个步骤，其实选择什么软件都无所谓，比对可以是BWA或者bowtie等等，去PCR重复的软件也有十几个，后面的找变异也是十几个软件，其实这些步骤合并成为一个脚本即可：

samtools mpileup -ugf ref.fasta test.bam  |bcftools call -vmO z -o test.bcftools.vcf.gz

如果要理解上面的命令，就需要自己去看一些软件的文档：

http://www./doc/samtools.html
https://samtools./bcftools/bcftools.html

但是，找到变异位点仅仅是ngs数据分析的第一步而已，参考基因组那么大，不同基因的不同功能区域的突变，以及该突变位点是否有生物学意义，后续的统计可视化才是重点。尤其是微生物，往往是成千上万个测序结果，我们关心的是群体概念：

整体查看突变

如果是人类DNA或者RNA测序数据，需要找突变，通常是gatk流程，前面的bam文件取决于你的比对软件，有了bam文件后，比如如果是hisat2或者STAR对RNA-seq数据的比对，就需要如下所示的步骤：

sambamba markdup -
$gatk SplitNCigarReads -
$gatk SplitNCigarReads 
$gatk AddOrReplaceReadGroups  
$gatk   HaplotypeCaller

如果是肿瘤DNA测序数据找变异，通常是somatic突变，流程是：

sequencing reads were aligned to the human genome (hg 19) using BWA
Duplications were marked using Picard Tools
Insertion–deletion realignment and base recalibration were achieved using GATK
Somatic variants were detected using an ensemble approach with two variant callers: MuTect2
Variant annotation was performed using ANNOVAR
去除低于4%的，以及可能的artifacts，并且仅仅是保留nonsynonymous single nucleotide variants, stop gain mutations, and frameshift mutations.

参考文献：《Genomic profiling reveals heterogeneous populations of ductal carcinoma in situ of the breast》