GATK4 对于体细胞突变和生殖细胞突变的检测分别给出了对应的pipeline:
本篇主要关注生殖细胞突变的分析流程 图中红色方框部分的从Analysis-Ready Bam 到,主要包括以下4步
官网给出了6套用于参考的pipeline 主要分析步骤都差不多,这里我选择第4个通用的流程 ,网址如下
1. HaplotyperCaller in GVCF mode对于每个样本,采用 gatk --java-options "-Xmx6G -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10" HaplotypeCaller -R ${ref_fasta} -I ${input_bam} -L ${interval_list} -O ${output_filename} -contamination 0 -ERC GVCF
2. ImportGenomicsDB Consolidate GVCFs将所有样本的gvcf文件合并,产生一个总的gvcf文件,命令如下: gatk --java-options -Xmx2G MergeVcfs --INPUT ${sep=' --INPUT ' input_vcfs} --OUTPUT ${output_filename} 3. GenotypeGVCFs包括两个步骤,第一步,导入 gatk --java-options "-Xmx4g -Xms4g" GenomicsDBImport --genomicsdb-workspace-path ${workspace_dir_name} --batch-size ${batch_size} -L ${interval} --sample-name-map ${sample_name_map} --reader-threads 5 -ip 500
第二步, 运行 gatk --java-options "-Xmx5g -Xms5g" GenotypeGVCFs -R ${ref_fasta} -O ${output_vcf_filename} -D ${dbsnp_vcf} -G StandardAnnotation --only-output-calls-starting-in-intervals --use-new-qual-calculator -V gendb://$WORKSPACE -L ${interval} 需要注意-V 参数,指定的是 4. Filter Variants by Variabt Recalibration第一步,过滤vcf文件,条件为 gatk --java-options "-Xmx3g -Xms3g" VariantFiltration --filter-expression "ExcessHet > ${excess_het_threshold}" --filter-name ExcessHet -O ${variant_filtered_vcf_filename} -V ${vcf}
第二步,删除vcf文件中的基因型信息,命令如下 gatk --java-options "-Xmx3g -Xms3g" MakeSitesOnlyVcf --INPUT ${variant_filtered_vcf_filename} --OUTPUT ${sites_only_vcf_filename} 第三步,合并不同区间的vcf文件,并建立索引 gatk --java-options "-Xmx6g -Xms6g" GatherVcfsCloud --ignore-safety-checks --gather-type BLOCK --input ${inputs.list} --output ${output_vcf_name}
gatk --java-options "-Xmx6g -Xms6g" IndexFeatureFile --feature-file ${output_vcf_name}
第四步,分别评估SNP和INDEL突变位点的质量, 命令如下 gatk --java-options "-Xmx24g -Xms24g" VariantRecalibrator -V ${sites_only_variant_filtered_vcf} -O ${recalibration_filename} --tranches-file ${tranches_filename} --trust-all-polymorphic -tranche ${sep=' -tranche ' recalibration_tranche_values} -an ${sep=' -an ' recalibration_annotation_values} -mode INDEL --max-gaussians 4 -resource mills,known=false,training=true,truth=true,prior=12:${mills_resource_vcf} -resource axiomPoly,known=false,training=true,truth=false,prior=10:${axiomPoly_resource_vcf} -resource dbsnp,known=true,training=false,truth=false,prior=2:${dbsnp_resource_vcf}
gatk --java-options "-Xmx100g -Xms100g" VariantRecalibrator -V ${sites_only_variant_filtered_vcf} -O ${recalibration_filename} --tranches-file ${tranches_filename} --trust-all-polymorphic -tranche ${sep=' -tranche ' recalibration_tranche_values} -an ${sep=' -an ' recalibration_annotation_values} -mode SNP --sample-every-Nth-variant ${downsampleFactor} --output-model ${model_report_filename} --max-gaussians 6 -resource hapmap,known=false,training=true,truth=true,prior=15:${hapmap_resource_vcf} -resource omni,known=false,training=true,truth=true,prior=12:${omni_resource_vcf} -resource 1000G,known=false,training=true,truth=false,prior=10:${one_thousand_genomes_resource_vcf} -resource dbsnp,known=true,training=false,truth=false,prior=7:${dbsnp_resource_vcf}
gatk --java-options "-Xmx5g -Xms5g" ApplyVQSR -O tmp.indel.recalibrated.vcf -V ${input_vcf} --recal-file ${indels_recalibration} --tranches-file ${indels_tranches} --truth-sensitivity-filter-level ${indel_filter_level} --create-output-variant-index true -mode INDEL
gatk_path --java-options "-Xmx5g -Xms5g" ApplyVQSR -O ${recalibrated_vcf_filename} -V tmp.indel.recalibrated.vcf --recal-file ${snps_recalibration} --tranches-file ${snps_tranches} --truth-sensitivity-filter-level ${snp_filter_level} --create-output-variant-index true -mode SNP
|
|