【原】留下还是放弃，是一个问题。

健明 2021-07-14

展开全文

关于如何 call variant 的问题之前在博客已经陆续整理过一些文章，但每次分析需要call variant的数据都会在如何 filter 的问题上陷入短暂的无助和恐慌。

涉及到 call variant 的问题，不管后续分析目的是什么，GATK 和 bcftools 都有几个比较通用的标准，也就是若干和质量相关的筛选标准。但是随着后续分析用途的不同，筛选的标准又有差别，很难说有一个非常合适的金标准。

常用标准

GATK filter

#gatk snps
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT test
Chr1 55398 . A G 56.74 . AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=28.37;SOR=2.303 GT:AD:DP:GQ:PL 1/1:0,2:2:6:84,6,0
Chr1 158267 . C G 41.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.000;ClippingRankSum=0.000;DP=5;ExcessHet=3.0103;FS=3.979;MLEAC=1;MLEAF=0.500;MQ=50.22;MQRankSum=-1.645;QD=8.35;ReadPosRankSum=0.524;SOR=2.245 GT:AD:DP:GQ:PL 0/1:3,2:5:70:70,0,113

以下是GATK论坛官方推荐

For SNPs:

QD<2.0
MQ<40.0
FS>60.0
SOR>3.0
MQRankSum<-12.5
ReadPosRankSum<-8.0

For indels:

QD<2.0
ReadPosRankSum<-20.0
InbreedingCoeff<-0.8
FS>200.0
SOR>10.0

一般会再上面的基础加一个 QUAL<30.0，有些公司使用的参数则是 QUAL<30.0||QD<2.0||FS>60.0||MQ<40.0，另外在全基因组测序数据中，通常会根据基因覆盖度增加测序深度的筛选条件。比如 DP>10

关于一些参数的官方解释和解读如下

QUAL - quality

Phred格式(Phred_scaled)的质量值，该位点存在variant的可能性；该值越高，则variant的可能性越大。Phred值 = -10 * log (1-p)，其中p为variant存在的概率; 值为10的表示错误概率为0.1，该位点为variant的概率为90%。 Q=20时，错误率就控制在了0.01。

QualByDepth (QD)

This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-reference samples.

之所以要用总的qual除以AD，是因为qual和AD成一定程度正比关系，这样可以消除深度对于分数的影响。

FisherStrand (FS)

Phred-scaled p-value using Fisher's Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads. More bias is indicative of false positive calls.

GATK中这个数据对应samtools的DP4，主要是为了消除正负链的偏好性。samtools 中的 Mann-Whitney U test of Mapping Quality vs Strand Bias (bigger is better)" 和该参数也相关。

RMSMappingQuality (MQ)

This is the Root Mean Square of the mapping quality of the reads across all samples.

MappingQualityRankSumTest (MQRankSum)

This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for mapping qualities (reads with ref bases vs. those with the alternate allele). Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls. 只针对杂合

ReadPosRankSumTest (ReadPosRankSum)

This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for the distance from the end of the read for reads with the alternate allele. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.

StrandOddsRatio (SOR)

The StrandOddsRatio annotation is one of several methods that aims to evaluate whether there is strand bias in the data. Higher values indicate more strand bias indel类型数据中去除链bias

samtools bcftools

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT R02
Chr1 31071 . A G 225 . DP=38;VDB=0.823895;SGB=-0.693143;MQSB=1;MQ0F=0;AC=2;AN=2;DP4=0,0,16,22;MQ=60 GT:PL 1/1:255,114,0

DP4[2]+DP4[3]>=5&&MQ>=40&&QUAL>=30

需要注意的是，GATK 相比于samtools 更容易找到杂合突变。

可选过滤条件

目前在一些文献和介绍中，已经见得过滤筛选条件还有以下几种

MAF和minor allele count

对于二倍体样本：基因型GT 0/1 表示样本为杂合子，Allele（AC）为1（二倍体样本在该位点只有1个等位基因发生突变），Allele频率（AF）为0.5（二倍体的样本在该位点只有50%等位基因发生突变)，总Allele（AN）为2；基因型 1/1 表示样本为纯合，Allele（AC）为2，Allele频率（AF）为1，总Allele（AN）为2。等位基因频率的计算方法是 the number of times an allele appears over all individuals at that site, divided by the total number of non-missing alleles at that site.

Minor allele frequency (MAF) 定义：refers to the frequency at which the second most common allele occurs in a given population.

例如

MAF/MinorAlleleCount: C=0.1506/754 (1000 Genomes)
a. C, is the minor allele for that particular locus
b. 0.1506, is the frequency of the C allele (MAF), it means 15% within the 1000 Genomes database.
c. 754, is the number of times this SNP has been observed in the population of the study.

These criteria include filtering for mean read depth across all samples covering a SNP to be at least four reads, for minor allele frequency greater than 0.35, for SNP quality >30, and for SNPs that are not within 35 bp of another SNP.
VCF files were first filtered for depth (four reads per genotype covering base of interest) using a custom python script. Then vcftools (Danecek et al. 2011) was run with the following parameters: "–maf 0.35 –minQ 30 –remove-filtered-all –remove-indels –recode." Finally,

BSA 分析中SNP过滤的方法

SNP 间距

实例一

SNPs within 35 bp of another called SNP were filtered out using a custom python script.

实例二

Unigenes that showed apparent SNP densities of greater than 5 SNPs/kb were considered artefactual (or paralogous) and excluded from the analysis (the SNP density between LDN and RSL65 had been experimentally determined as 2.2 SNPs/kb).

SNP和indel距离

通常indel附近的SNP不准确，会考虑滤掉和indel相近的snp。

实例一

false positive SNPs due to local read mis-alignments due to indels. It will remove SNP calls at or within 1 bp of an indel's position (as reported in mpileup)

实例二

一篇文献中的RNAseq和call snp 参数

samtools mpileup -R -d 1000000 -t DP,AN -Q 0 -Bug -f tomato.SL2.50.fa ejmt.sorted.redup.bam | bcftools call -mv -O z > ejmt.samtools.raw.vcf
bcftools filter -g 100 -e "MIN(MQ<50)" ejmt.samtools.raw.vcf | bcftools view -m 2 -M 2 -v snps -o ejmt.samtools.filtered.vcf

针对多倍体的过滤方式

针对多倍体数据，有人开发了Sliding Window Extraction of Explicit Polymorphisms 过滤方式

文献 SWEEP: Sliding Window Extraction of Explicit Polymorphisms.

如果处理多倍体数据可以参考。

One more thing

除了上述过滤思路，以下内容也需要注意！

测序深度不均一

带来假阳性的原因可能是来自与测序覆盖程度不均一，所以需要evaluate whole genome population based variant calling。

低复杂度区域过滤

根据文献 , 所谓**低复杂区域主要是由那些基因组上的局部重复区域组成，这些区域可以使用RepeatMasker或者mdust 来寻找，另外寻找重复区域的突变可以考虑使用类似于lobSTR 这类专门设计的工具。

根据已有的一些测试结果，删除重复局域对于snp的影响不是非常明显，但是对于indel非常明显。重复区域影响indel的原因有两个，有一些1bp的indel位于 long poly-A or poly-T runs，而这可能是由于Potential PCR amplification errors 引起的；另外，基因组的重复区域在mapping过程中有multi-alignment会导致假阳性产生。如下图所示：

高深度位点过滤

对于一个正常的思维，我们要剔除的是那些覆盖度很低的突变位点，为什么深度非常高的位点也要剔除呢？

因为很多情况下，这些假阳性是由于参考基因组里面本来没有的拷贝数变异和旁系同源序列引起的。已有的相关测试表示，通过删除深度位点，可以剔除大量的假阳性杂合子，尤其是在使用BWA-MEM进行mapping的时候。

看到这里就需要注意使用什么方法对数据进行mapping的问题，使用bowtie2进行mapping时会给那些多位置比对的序列更低的分数，有助于降低假阳性，但是因为其对mapping quality 的打分过于保守，很可能会带来假阴性。而且不同的比对软件和找突变的方法会对indel结果有较大的影响。如下文引用部分所述。

HaplotypeCaller always called more ≥15bp INDELs from the BWA-MEM alignment (data not shown). Other callers made three times as many ≥15bp deletion calls from the BWA-MEM alignment, either in LCRs or not, and called 40% more insertions outside LCRs. Interestingly, except HaplotypeCaller, others called more ≥15bp insertions from the Bowtie2 alignment in LCRs instead.

基于以上原因，应该对那些深度很深的突变位点进行删除，但是如果本身测序深度不高而且覆盖度不均一的时候最好不要这么做，例如外显子测序数据。为了更合理地删除那些假阳性的高深度数据，我们可以结合QUAL进行删除。

深度阈值 depth-cutoff = $average_depth+4*\sqrt{average_depth}$

筛选条件 QUAL<(depth-cutoff*2.0)&&DP>depth-cutoff

以水稻举例，日本晴材料基因组大小为370M左右，如果测序样本的read数在60M，处理后read长度120bp，则平均测序深度约为20X。拟南芥基因组大小约为110M。

另外，针对旁系同源和多拷贝基因座带来的干扰问题，samtools 中引入了 Base Alignment Quality (BAQ)，如果不想使用这个参数可以添加 -B参数，目前看到有BSA相关文章就使用了该参数，不进行BAQ校正。

参考资料

http:///2014/05/12/wgs-trio-variant-evaluation/

http:////filtering/

Towards Better Understanding of Artifacts in Variant Calling from High-Coverage Samples

在生信技能树公众号历史消息中搜索 “变异”, 查看更多相关文章。