【原】最新最全的mutect2教程

健明 2021-07-14

展开全文

GATK的Mutect2流程一直在变动，主要是GATK本身也更新频率有点高，所以基本上大家看到的教程很快就过时了，follow起来都是错误连连。现在这个教程的时间是：2020-09-22 （只能保证说未来半年内可能是ok的）

也就是说我搜索到了一个4小时前的教程，取代了之前的一个月前的教程，这，生活太苦了。

https://gatk./hc/en-us/articles/360035889791?id=11136
https://gatk./hc/en-us/articles/360035531132

其它参考资料包括：

https://gatk./hc/en-us/articles/360035531132--How-to-Call-somatic-mutations-using-GATK4-Mutect2
https://github.com/broadinstitute/gatk/blob/master/docs/mutect/mutect.pdf
http://rstudio-pubs-static.s3./491640_6c12de260e734538b01777c0cc4ff92b.html
https://gatk./hc/en-us/articles/360035531852-Intervals-and-interval-lists

这样的资料是谷歌云数据文件（大约几百个G的磁盘空间消耗）：

https://console.cloud.google.com/storage/browser/gatk-best-practices
https://console.cloud.google.com/storage/browser/genomics-public-data

上来就摆这么多参考资料，可想而知最新最全的mutect2教程是多么的难写！

背景知识

体细胞突变（somatic mutation）是指患者某些组织或者器官后天性地发生了体细胞变异，虽然它不会遗传给后代个体，却可以通过细胞分裂，遗传给子代细胞。体细胞突变对肿瘤的发生发展有关键性的作用，并且它也是制定肿瘤癌症靶向治疗措施的关键所在。

NGS使体细胞变异的检测更加全面，成本更低，在检测多种体细胞变异上具有很大的优势，但在使用过程中还存在着挑战：如**样品降解、覆盖度不足、遗传异质性和组织污染（杂质)等问题。为应对以上挑战，降低错误率，科学家采取了不同的算法和统计模型用于检测体细胞突变。目前最受欢迎的有Varscan、SomaticSniper、 Strelka 和MuTect2 **。这些软件大都是直接对肿瘤-正常样本的每个位点进行比较，对肿瘤样本中明显高于正常样本的次等位基因进行标记，作为体细胞变异，同时排除种系突变和杂合性丢失（LOH）情况。虽然这些软件具有较高的引用率，并在不断地更新，但仍存在不足：

a 、缺乏完整可靠的实验来评估检测结果；
b、缺乏金标准，不能保证检测到的灵敏度和特异性最高；
c、在实际应用中，各软件的相对优缺点在很大程度上是未知的。

下面是TCGA计划采取的软件：

MuSE
varscan
MuTect
SomaticSniper

大家可以去下载到TCGA计划的这4个软件输出的maf文件格式的somatic突变信息文件哦。

第一步：构建PoN

First, run gatk Mutect2 in tumor-only mode on each normal sample to call all detectable variants: Needs an interval list. 可以使用gatk 的 BedToIntervalList和PreprocessIntervals命令来根据bed制作interval list.，注意一下wgs和wes的差异即可。

最新版的GATK里面的CreateSomaticPanelOfNormals命令更新了，需要下面3个步骤完成 panel of normal (PoN) 流程

Step 1: Run Mutect2 in tumor-only mode for each normal sample:

把所有的normal的bam文件的路径整理好，然后批量运行，脚本如下：

## $1 for the config file, such as : normal_bam_for_pon.txt ## $2 and $3 for submit jobs.# for i in {0..5};do sbatch run_PoN-step1.sh normal_bam_for_pon.txt 6 $i;done# conda activate qcGENOME=/home/jianmingzeng/biosoft/GATK/resources/bundle/hg38/Homo_sapiens_assembly38.fasta GATK=/home/jianmingzeng/biosoft/gatk/gatk-4.1.8.1/gatk cat $1 |while read normal_bamdoecho $normal_bamif((i%$2==$3))then id=$(basename "$normal_bam" _recal.bam) $GATK Mutect2 \ -R ${GENOME} \ -I $normal_bam \ --max-mnp-distance 0 \ -O ${id}_mutect2_normal.vcf.gz \ 1> ${id}_mutect2_normal.log 2>&1

fii=$((i+1))done

一般来说，批量运行结束后输出如下的文件，如果你的测序数据很稳定，那么基本上每个normal的wes数据走完流程的文件大小是类似的：

71M Sep 15 16:30 N1092_mutect2_normal.vcf.gz61M Sep 15 15:40 N1146_mutect2_normal.vcf.gz50M Sep 15 15:41 N1254_mutect2_normal.vcf.gz57M Sep 15 15:30 N1257_mutect2_normal.vcf.gz54M Sep 15 15:18 N1268_mutect2_normal.vcf.gz52M Sep 15 22:10 N1271_mutect2_normal.vcf.gz55M Sep 15 20:09 N1281_mutect2_normal.vcf.gz

Step 2: Create a GenomicsDB from the normal Mutect2 calls:

这个时候需要合并上述每个单独normal的wes数据得到的vcf文件啦，需要指定wes的范围文件（bed格式），使用GenomicsDBImport命令进行合并操作，代码如下：

GENOME=/home/jianmingzeng/biosoft/GATK/resources/bundle/hg38/Homo_sapiens_assembly38.fasta GATK=/home/jianmingzeng/biosoft/gatk/gatk-4.1.8.1/gatk

interval=/home/jianmingzeng/biosoft/GATK/resources/bundle/hg38/resources_broad_hg38_v0_wgs_calling_regions.hg38.interval_list

# normal_bam_for_pon.txt${GATK} GenomicsDBImport \-R ${GENOME} \-L ${interval} \--merge-input-intervals TRUE \$(ls *.vcf.gz | awk '{print "-V "$0" "}') \--genomicsdb-workspace-path pon_db \1> GenomicsDBImport.log 2>&1

Step 3: Combine the normal calls using CreateSomaticPanelOfNormals:

GenomicsDBImport命令的合并输出的是临时文件夹pon_db，接下来仍然是需要使用衔接命令CreateSomaticPanelOfNormals来输出真正的pon.vcf.gz 文件。

这个时候还加入了somatic-hg38_af-only-gnomad文件，来源于GATK的官方谷歌云下载中心。

gnomad=/home/jianmingzeng/biosoft/GATK/resources/bundle/hg38/somatic-hg38_af-only-gnomad.hg38.vcf ${GATK} CreateSomaticPanelOfNormals \-R ${GENOME} \--germline-resource ${gnomad} \-V gendb://pon_db \-O pon.vcf.gz \1> CreateSomaticPanelOfNormals.log 2>&1

In the third step, the AF-only gnomAD resource is the same public GATK resource used by Mutect2 (when calling tumor samples, but not in Step 1, above).

走这3个步骤，就是为了生成 pon.vcf.gz 文件，供后续使用！

第二步：call mutations in the tumor in somatic mode.

主要是 Mutect2 命令，以及 FilterMutectCalls 命令。

首先看Mutect2 命令的代码，前面步骤生成 pon.vcf.gz 文件这个时候就利用上来了，需要制作一个config文件，配合下面的脚本，主要是3列信息：

第一列是肿瘤命名
第二列是肿瘤病人的normal组织的bam文件地址
第三列是肿瘤病人的肿瘤组织的bam文件地址。

代码如下：

GENOME=/home/jianmingzeng/biosoft/GATK/resources/bundle/hg38/Homo_sapiens_assembly38.fasta reference=$GENOMEGATK=/home/jianmingzeng/biosoft/gatk/gatk-4.1.8.1/gatk cat $config_file |while read iddoarr=($id)normal_bam=${arr[1]}tumor_bam=${arr[2]}sample=${arr[0]} # 上面是解析配置文件config time $GATK Mutect2 -R $reference \ -I $tumor_bam -tumor $(basename "$tumor_bam" _recal.bam) \ -I $normal_bam -normal $(basename "$normal_bam" _recal.bam) \ -pon pon.vcf.gz \ -O ${sample}_mutect2.vcf

done ## end for $config_file

这样的话，每个样本大约可以拿到四五千左右的可能的somatic位点，需要进行一定程度的过滤操作。（去年我的教程过滤很简单，就是FilterMutectCalls 命令而已，现在它已经成为了辣鸡，不用它了。）

这个时候的过滤需要两个文件的配合，首先是read-orientation-model.tar.gz，其次是 contamination.table和segments.table。

真恶心，辣鸡软件！不用了···

后面我看了看教程，整理了就放弃了，浪费时间，浪费生命！真恶心，辣鸡软件！不用了···

制作contamination.table和segments.table

真恶心，辣鸡软件！不用了···

参考：https://github.com/broadinstitute/gatk/blob/master/docs/mutect/mutect.pdf

需要 GetPileupSummaries以及CalculateContamination命令：

https://gatk./hc/en-us/articles/360037593451-GetPileupSummaries
https://gatk./hc/en-us/articles/360036888972-CalculateContamination

同样的解析配置文件config，走下面的代码：

done ## end for $config_file

制作read-orientation-model.tar.gz文件

真恶心，辣鸡软件！不用了···

执行 FilterMutectCalls 命令

真恶心，辣鸡软件！不用了···

GENOME=/home/jianmingzeng/biosoft/GATK/resources/bundle/hg38/Homo_sapiens_assembly38.fasta reference=$GENOMEGATK=/home/jianmingzeng/biosoft/gatk/gatk-4.1.8.1/gatk GATK=/home/jianmingzeng/biosoft/GATK/gatk-4.0.3.0/gatkls *_mutect2.vcf |while read iddosample=$(basename "$id" _mutect2.vcf)$GATK FilterMutectCalls -R $reference \ -V $id \ -O ${sample}_filtered.vcf done

会报错，gatk官方论坛的意思是，在集群运行的过程中，会丢失后缀为.vcf.stats的文件，所以FilterMutectCalls 命令失败。如下：

A USER ERROR has occurred: Mutect stats table _mutect2.vcf.stats not found. When Mutect2 outputs a file calls.vcf it also creates a calls.vcf.stats file. Perhaps this file was not moved along with the vcf, or perhaps it was not delocalized from a virtual machine while running in the cloud.

真恶心，辣鸡软件！不用了···

我们我敢抛弃mutect2呢

虽然说，GATK是人类数据的首选工具，但是GATK的Mutect2流程并不是唯一的找somatic mutation选择。简单搜索几个关于的综述或者测评工具：

2016年文章：Evaluation of Nine Somatic Variant Callers for Detection of Somatic Mutations in Exome and Targeted Deep Sequencing Data，链接是：https://journals./plosone/article?id=10.1371/journal.pone.0151664

2018年文章：A review of somatic single nucleotide variant calling algorithms fornext-generation sequencing data，链接是https://www./science/article/pii/S2001037017300946