分享

覆盖均一性对靶向 NGS 中靶率的重要性

 科研蘑菇头 2022-11-19 发布于上海

The Importance of Coverage Uniformity Over On-Target Rate for Efficient Targeted NGS

个人学习资料,如有翻译错误,望批评指正

背景

Next-generation sequencing (NGS) has become the technique of choice for variant detection in both research and clinical settings. Although the cost of sequencing is steadily decreasing, large-scale, whole genome sequencing is still prohibitively expensive, so investigations often focus on specific genes and loci using targeted sequencing (Dillon, et al. 2018). 

二代测序 (NGS) 已成为科研和临床中变异检测的首选技术。尽管测序的成本正在稳步下降,但大规模的全基因组测序仍然昂贵得令人难以接受,因此科研及临床检测往往基于特定基因、特定位点的靶向测序。

Targeted sequencing relies on enrichment of genomic regions of interest prior to sequencing. In exome sequencing, for example, biotinylated synthetic DNA probes are designed to hybridize to exon regions. Following hybridization with a genomic DNA sample, probes are purified to produce a sample that is enriched for the exon regions. Although target enrichment can reduce sequencing costs and make experiments more feasible and focused, it also introduces biases that compromise the efficiency of the sequencing effort (Goldfeder et al. 2016, Meynert et al. 2013, 2014).

靶向测序依赖于在测序之前对感兴趣的基因组区域进行富集。例如,在外显子组测序中,生物素化的合成DNA探针被设计成与外显子区域杂交。在与基因组DNA样本杂交后,对探针进行纯化,以产生富含外显子区域的样本。尽管靶向富集可以降低测序成本,使实验更可行和更有针对性,但它也引入了偏差从而损害测序工作效率。

While some inefficiency is unavoidable due to the stochastic nature of targeted NGS, much of it is inherent to the design and production of target enrichment probe panels (Warr et al. 2015). Some probes cross-hybridize to non-target regions, leading to “off-target” (non-specific) capture. Probe panels may also have imbalances in capture efficiency (lack of uniformity) that lead to over-enrichment of some targets and under-enrichment of others. To ensure high-confidence data, researchers must increase the amount of sequencing to boost coverage of areas with low read depth. This strategy, however, leads to over-sequencing of otherwise adequately covered regions, which in turn results in higher sequencing costs and reduced efficiency.

由于靶向富集NGS的随机性,使得部分捕获区域效率低下是不可避免的,但其中很大一部分又是目标捕获探针Panel设计和生产所需要的。一些探针交叉杂交到非靶标区域,导致“非靶标”(非特异性)捕获。探测Panel在捕获效率方面也可能存在不平衡(缺乏均一性),从而导致某些目标区域过度富集而另一些目标区富集不足。为了确保高置信度数据,研究人员必须增加测序量,以扩大对低测序深度区域的覆盖深度。然而,这一策略导致了对原本覆盖充分的区域的过度测序,从而导致更高的测序成本和更低的效率。

The extent of this “wasted sequencing” is reflected in the uniformity and on-target rate, two metrics that describe the overall efficiency of targeted sequencing. In this paper, we use ranges of on-target rate and uniformity typical of commercial exome kits to mathematically model the relative impacts of both metrics on overall efficiency. We demonstrate that, though most commercial probe panels cite only on-target rate in their specifications, uniformity has a more significant contribution to the efficiency of targeted sequencing.

这种“浪费测序”的程度反映在均一性和中靶率上,这两个指标描述了靶向测序的总体效率。在本文中,我们以商业化外显子Panel的中靶率和均一性进行数学建模,来评价两个指标对整体测序效率的影响。我们证明,尽管大多数商业探针Panel 在其规格中只引用了中靶率,但均一性对靶向测序的效率有更显著的贡献。

评估测序的参数

When designing a sequencing experiment, a fundamental task is to determine how many reads are required per sample for actionable data (read coverage). The answer determines the costs, feasibility, number of samples to include, and ultimately the study power to reach meaningful conclusions. Different applications require different read coverage: for example, whereas information from ten reads that align over a given position (10x coverage) may suffice for a call of germline variation in a research setting, this number would be inadequate for a confident call of somatic mutation in a clinical setting. We refer to the desired coverage as CD and the mean coverage actually observed in the experiment as CM.

在设计测序实验时,一项基本任务是确定每个样本需要多少Reads才能获得可操作的数据(Read 覆盖深度)。答案决定了成本、可行性、以及一次测序实验中可包测样本数,最终,研究才能得出有意义的结论。不同的应用需要不同的Read覆盖深度:例如,来自在给定位置对齐的 10 个Reads的信息(10 X覆盖深度)可能足以在科研中调用种系变异,但在临床环境中这个数字不足以确定体细胞突变的调用。我们将所需的覆盖深度称为CD,将实验中实际观察到的平均覆盖深度称为 CM。

An ideal sequencing experiment would generate reads that are distributed equally and exclusively across target regions (perfect uniformity and on-target capture, respectively). The rest of the genome would be devoid of reads (Figure 1A). In this ideal scenario, sequencing efficiency would be 100%, and CM would equal CD. Non-uniform and off-target capture are inevitable, however, and they lead to variable coverage  (Figure 1B).

一个理想的测序实验应该是产生的Reads平均分配且完全跨越目标区域(即均一性和中靶率)。基因组的其余部分将没有Reads(图1A)。在这种理想的情况下,测序效率将是100%,CM将等于CD。然而,非均一性和捕获脱靶是不可避免的,它们导致了可变的覆盖深度。

Figure 1. Read distribution. A. Read distribution in an ideal experiment, where all targets have specific and equal read depth, and non-target regions are free of reads. In this situation, CM=CD. B. Representation of a realistic distribution of coverage, where some targets are under-sequenced, others are over-sequenced, and off-target regions are also captured. 

图1. Reads分布。A.理想实验中的Reads分布,即所有目标区域具有特定且相等的测序深度,非目标区域不存在Reads。在这种情况下,CM=CD。B. 表示实际Reads覆盖的分布,其中一些目标区域测序不足,另一些目标区域过度测序,而且也捕获了非靶标区域。

To ensure coverage of most targeted regions reaches CD, the amount of sequencing is often increased such that CM >> CD (Figure 1B). This strategy, however, wastes a considerable fraction of sequencing reads. The CM/CD ratio represents the amount over-sequencing needed to ensure a certain percentage of targets reach CD: the larger the ratio, the more over-sequencing will be required to get enough usable data.Optimizing the efficiency of targeted NGS, therefore, involves minimizing the CM/CD ratio without compromising results.

为了确保大多数目标区域覆盖深度达到CD,通常增加测序量,使CM >> CD(图1B)。然而,这种策略浪费了相当一部分的测序数据。CM/CD比率表示为确保一定比例的目标区域达到CD所需要的过度测序数据量:该比率越大,就需要越多的过度测序来获得足够的可用数据。因此,优化靶向NGS的效率需要在不影响结果的情况下最小化CM/CD比。

均一性与Fold-80

Uniformity describes the read distribution along target regions of the genome. Uniform coverage reduces the amount of sequencing required to reach a sufficient depth of coverage for all regions of interest. Uniformity is a measure of the spread around the CM and is estimated from the mean and quantiles of the read distribution (Figure 2).

均一性描述了基因组靶区域的Reads分布。均一的覆盖减少了所需的测序数据量,以达到覆盖所有感兴趣区域的足够深度。均一性是CM周围分布的度量,由平均值和Reads分布的数量评估(图2)。

Figure 2. Uniformity reflects distribution shape. Two different hypothetical read distribution profiles showing low (green) and high (gray) fold-80 scores and the relative abundance of reads mapping back to over- and under-sequenced regions. Lowering the fold-80 score (gray curve to green curve) both rescues under-sequenced regions and reduces the fraction of over-sequenced regions for more efficient read utilization. In reality, poor uniformity often shows less symmetric distributions.

图2。均一性分布形状。两种不同的假设的Reads分布剖面显示低(绿色)和高(灰色)Fold-80得分,以及映射到超测序和欠测序区域的Reads的相对丰度。降低fold-80得分(灰色曲线变为绿色曲线)既可以挽救测序不足的区域,也可以减少过度测序区域的比例,从而提高reads的利用率。在现实中,较差的均一性往往表现为较不对称的分布。

中靶率

On-target rate describes the percentage of sequencing data that maps to target regions; conversely, off-target rate refers to the sequencing data that maps to other regions (Figure 1B). It is typically expressed as the ratio of the number of sequenced bases covering the target regions to the total number of mapped bases output by the sequencer (Figure 3). Some off-target sequencing is inevitable; a considerable proportion of it is probe panel-specific and can be due to promiscuous hybridization.

中靶率描述了映射到目标区域的测序数据的百分比,脱靶率是指映射到其他区域的测序数据(图1B)。它通常表示为覆盖目标区域的测序碱基数与测序仪输出的映射碱基总数之比(图3)。相当大的比例的探针Panel是特异性的,但可能由于混杂交,导致一些脱靶测序不可避免。

Figure 3. On-target rate is the proportion of the sequencing effort that maps to targeted regions. In calculating on-target rate, the entire sequencing effort (∑ALL) is represented by the area under the sequencing curve, and the on-target area (∑On-Target) is represented by the green area. Here, off-target sequencing is indicated by the arrow.

图3。中靶率描述的是序列数据映射到靶区的百分比。在计算对中靶率时,整个测序结果(∑ALL)用测序曲线下的面积表示,中靶面积(∑on-target)用绿色面积表示。在这里,箭头表示脱靶测序。

优化均一性和中靶率的相互影响

Uniformity (fold-80) and on-target rate both define the efficiency of targeted sequencing. But how much impact does each metric have? As long as library preparation conditions for the probe panel are consistent, on-target rates tend to vary only a little and can be considered a “tax” on the sequencing effort (Chilamakuri et al 2014). When uniformity is perfect (fold-80 is 1.0), the on-target rate and CM are inversely proportional. For example, assuming a desired coverage (CD) of 10x and perfect uniformity, an on-target rate of 80% would mean one should aim for a CM of 12.5x:

均一性(Fold-80)和中靶率都决定了靶向测序的效率。但是每个指标有多大的影响呢?只要探针Panel的文库准备条件是一致的,中靶标率往往只有一点变化,可以被视为测序工作的“税”(Chilamakuri et al . 2014)。当均一性较好(fold-80为1.0)时,中靶率与CM成反比。例如,假设期望的覆盖深度(CD)为10X且具有完美的均一性,80%的中靶率意味着应该瞄准12.5X的CM

Conversely, even small improvements in fold-80 can significantly improve efficiency. Improving uniformity reduces coverage of over-sequenced targets and increases coverage of undersequenced targets.

相反,即使是Fold-80的小改进也能显著提高测序效率。提高均一性降低了过度测序区域覆盖深度,增加了欠测序区域的覆盖深度。

To examine the relative effects of on-target rate and uniformity, we simulated 3,003 normal distributions with varying uniformity, mean coverage, and on-target rates. Improving the on-target rate while maintaining constant uniformity (Figure 4A) shifts the coverage distribution toward higher mean (CM) values, increasing the proportion of bases covered above the desired coverage  (CD). Improving fold-80 scores, as stated earlier, improves read utilization by both rescuing under-sequenced regions and reducing the fraction of over-sequenced regions (Figure 4B). In this case, although mean coverage (CM) values remain constant, the proportion of bases covered above the desired coverage (CD) increases. In both figures, the differences in the number of actionable bases are represented by the areas between the curves, below the CD.

为了检验中靶率和均一性的相对影响,我们模拟了3,003个具有不同均一性、平均覆盖深度和中靶率下的正态分布。在保持均一性不变的情况下,提高中靶率(图4A)会使覆盖深度分布向更高的平均值(CM)移动,增加目标覆盖的比例,超过所需的覆盖深度(CD)。如前所述,提高Fold-80得分可以通过挽救未测序区域和减少过度测序区域的比例来提高Reads利用率(图4B)。在这种情况下,尽管平均覆盖深度(CM)值保持不变,但覆盖在期望覆盖深度(CD)以上的目标区域的比例增加了。在这两个图中,可操作碱基数量的差异由曲线之间的面积表示,在期望覆盖深度(CD)以下。

Figure 4C illustrates the combined impacts of changing ontarget rates, fold-80 scores, and mean coverage. Each colored curve represents a different fold-80, and the width of the curve represents the percentage of actionable bases recovered when on-target rates are between 80% (bottom limit of each curve) and 100% (top). In each curve, when CM is 30x, improving on-target rate from 80% to 100% — essentially eliminating all off-target sequencing — increases the fraction of actionable bases by 1–2%. In contrast, improving fold-80 from 1.7 to 1.4 increases this number more dramatically, by 5–6%.

图4C说明了改变中靶率、Fold-80得分和平均覆盖深度的综合影响。每条彩色曲线表示不同的Fold-80,曲线的宽度表示当脱靶率介于80%(每条曲线的下限)和100%(上限值)之间时,可操作碱基回收率的百分比。在每条曲线中,当CM为30X时,将中靶率从80%提高到100%——基本上消除了所有脱靶测序——可操作碱基的比例增加1%-2%。相比之下,将Fold-80得分从1.7提高到1.4,这一数字会显著增加5%-6%。

The data demonstrate that improvements to fold-80 scores (uniformity) have a much more significant impact on the efficiency of targeted NGS than do improvements to on-target rates, even if the off-target rate could be reduced to zero.

数据表明,即使脱靶率可以降低到零,但与中靶率相比,Fold-80得分(均匀性)的提高对靶向NGS效率的影响更显著。

A

B

Figure 4. Effect of uniformity versus on-target rate on required depth of sequencing. Simulation results assuming desired coverage (CD) = 10x, normal distribution of coverage depth, and varying mean coverage (CM), on-target rate (0.8–1.0) and fold-80 (1.1–2.0). A. Simulated depth of coverage distributions with changing on-target rates but constant fold-80 and CM (1.4 and 30, respectively). Improvements in on-target rate increase the mean coverage, shifting the distribution to the right. B. Simulated depth of coverage distributions with changing fold-80 and constant on-target rate and CM (0.9 and 30 respectively). Improving (reducing) fold80 scores reduces coverage of over-sequenced targets and increases coverage of under-sequenced targets. C. Proportion of target bases covered at 10x or higher for changing on-target rates, fold-80 scores, and mean coverage.

图4. 均一性与中靶率对所需测序深度的影响。假设期望覆盖深度(CD) = 10X,覆盖深度正态分布,平均覆盖深度(CM)、变化的中靶率(80%-100%)和Fold-80(1.1-2.0)。A.模拟覆盖深度分布,随着中靶率的变化,但Fold-80和CM不变(分别为1.4和30)。中靶率的改善增加了平均覆盖深度,使分布向右移动。B.模拟覆盖深度分布与变化的Fold-80和恒定的中靶率、平均覆盖深度(CM)(分别为90%和30X)。提高(降低)Fold-80得分降低了过度测序目标区域的覆盖率深度,增加了未测序目标区域的覆盖深度。C.10X的覆盖深度或更高的目标区域覆盖,可用于改善中靶率、Fold-80得分和平均覆盖深度。

结论和观点

In targeted NGS, uniformity (fold-80) and on-target rate are important metrics for evaluating efficiency of the sequencing effort. These two metrics are largely intrinsic properties of the probe panels themselves, and optimizing them can reduce the amount of sequencing needed to obtain high-confidence data.

在靶向NGS中,均一性(Fold-80)和中靶率是评估测序工作效率的重要指标。这两个指标在很大程度上是探针Panel本身的固有属性,优化它们可以减少获得高置信度数据所需的测序数据量。

Choosing the most efficient target enrichment system requires carefully weighing the actual range of uniformity and on-target rate offered. While on-target rate is important, we demonstrate here that improvements to fold-80 scores (uniformity) have a much more significant impact on the efficiency of targeted NGS.

选择最有效的靶向富集需要仔细权衡所提供的均一性和中靶率的实际范围。虽然中靶率很重要,但我们在这里证明,对Fold-80得分(均一性)的改进对靶向NGS的效率有更显著的影响。

中靶率

中靶率(On-target rate)是一个百分数,用来表示测序数据中有多少能够比对到目标区域上。在基因组上有许多与外显子有同源性的部分(比如内含子和基因间区),在实际工作中,这些并不属于目标(外显子)的部分在杂交过程中也会被捕获下来。这种探针捕获到非目标区域片段的情况称为脱靶(off target)。脱靶的数据是无效的,不能用于后续分析,即这部分测序数据被浪费了。同等情况下,中靶率越高,由于脱靶产生的浪费越少,这款探针越好。

均一性

目标区域内不同的位点被覆盖的情况是不同的。比如一次WES测序的平均深度是60X,很有可能有的位点深度为10X,有的为40X,有的为90X这样的情况。均一性(uniformity)越好,即这些位点各自的深度越接近平均深度。在实际工作中,我们根据期望达到的目标测序深度来分配数据量,即决定了这次测序的平均深度(平均深度=数据量/探针大小)。当某个区域的实际测序深度高于目标深度时,造成数据的浪费;而当某个区域的实际测序深度低于目标深度时,我们可能会认为这部分数据质量不好而丢弃它,导致这一区域无测序数据。均一性优良的探针可以帮助减少这两种情况的发生。Fold-80是用来评价均一性的指标。它的定义是,Fold-80 = 平均深度/(80%以上区域被覆盖的深度),是指要将 80% 的碱基提高到平均覆盖率所需的额外测序的倍数。Fold-80越低,捕获效率越高,测序浪费越少。理想情况下的Fold-80为1。Fold-80越低,均一性越好,越能节约测序成本,这款探针越好。

覆盖度

覆盖度(coverage)经常是和深度一起出现的,比如“10X覆盖度”、“30X覆盖度”。比如,“10X覆盖度为90%”指测序数据比对到目标区域后,有90%的区域被覆盖了至少10次,或者说有90%的区域有至少10条reads覆盖。如果覆盖度没有和深度一起出现,则可以理解为“1X覆盖度”。比如“覆盖度为95%”,指95%的目标区域有至少1条reads覆盖到。换言之,有5%的目标区域连1条覆盖到的reads都没有,它们在这次测序中完全没被测到,被漏掉了。同等情况下,覆盖度越高,越少比例的目标区域被漏掉,这款探针越好。

重复率

重复率(Dup rate)指的是重复序列(Duplicate reads)在总测序序列中的占比。由于这些重复序列不能带来额外信息,相反会影响变异检测结果准确性,因此需要在下游生信分析中去除这些重复序列。Dup rate越高,数据利用率越低,浪费的测序成本也就越多。同等情况下,重复率越低,越能节省测序成本,这款探针越好。

测序深度

测序深度(Depth)是指测序得到的碱基总量与基因组大小的比值。也就是平均每个碱基被测到的次数。测序深度与SNVs的检出率呈正相关。20× ROI(%)是指测序深度为20×以上的数据所占的比值。

参考文献

  1. Chilamakuri CSR, Lorenz S, Madoui M-A, Vodák D, Sun J, Hovig E, Myklebost O, Meza-Zepeda LA (2014) Performance comparison of four exome capture systems for deep sequencing. BMC Genomics 15(1): 449.

  2. Dillon OJ, Lunke S, Stark Z, Yeung A, Thorne N, Gaff C, White SM, Tan TY (2018) Exome sequencing has higher diagnostic yield compared to simulated disease-specific panels in children with suspected monogenic disorders. Eur J Hum Genet 26(5): 644–651.

  3. Goldfeder RL, Priest JR, Zook JM, Grove ME, Waggott D, Wheeler MT, Salit M, Ashley EA (2016) Medical implications of technical accuracy in genome sequencing. Genome Med. 8(1): 24.

  4. Meynert AM, Bicknell LS, Hurles ME, Jackson AP, Taylor MS (2013) Quantifying single nucleotide variant detection sensitivity in exome sequencing. BMC Bioinformatics. 14: 195.

  5. Meynert AM, Ansari M, FitzPatrick DR, Taylor MS (2014) Variant detection sensitivity and biases in whole genome and exome sequencing. BMC Bioinformatics. 15: 247.

  6. Warr A, Robert C, Hume D, Archibald A, Deeb N, Watson M. (2015) Exome Sequencing: current and future perspectives. G3: Genes|Genomes|Genetics. 5(8):1543–1550.

请扫码

给个关注

蘑菇头

学习 | 交流 | 分享

热点

    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多