首个临床NGS生物信息流程标准发布

teszsz 2017-11-27

展开全文

生物信息分析是对测序得到的原始序列进行数据分析和处理的过程，是临床基因诊断中遗传分析和临床决策的前提步骤。目前生物信息流程分析缺乏规范和标准，为了解决这个问题，美国分子病理学会（AMP）与美国病理学会（CAP）组织相关方制定并发布了《NGS生物信息流程验证标准和指南》，基因慧整理相关内容如下，仅供参考（在公众号主页回复“AMP”即可查看指南全文）。

图1，文章发表于《The Journal of Molecular Diagnostics》（j.jmoldx.2017.11.003）

划重点

1 确定基因测序的生物信息流程在临床的规范

2 明确生物信息的基本定义、流程和详细步骤

3 公开了查阅PubMed审核此规范和共识的策略

指南涵盖范围

《临床NGS生物信息流程验证标准和指南》主要内容包括临床NGS生物信息学流程验证的17条共识建议。涵盖了实验室NGS生物信息流程从设计、开发到运营的多条使用建议，强调经过合格培训的分子专业人员是确保NGS检测质量的关键因素。

生物信息流程和模块

文章提到，NGS生成海量基因数据后，生物信息学基于分子生物学的概念，通过计算（应用数学，计算机科学和统计学等）的方法和模型来分析和处理，得到有效信息。按照一定的逻辑和顺序来处理NGS数据的流程统称为NGS生物信息流程。包括：

软件和工具集合
数据库
操作环境（硬件和软件）

生物信息学流程的特点是自动化，需要质量控制（QC）以确保生成的数据稳定，准确，可重复和可追溯。与其他所有临床使用的其他硬件和软件一样，临床NGS流程的每一步都需要进行质量控制，这不仅对于病人的健康很关键，而且对于故障排除和符合监管要求也是至关重要。一个典型的生物信息NGS分析流程如图2所示：

图2，一个典型的生物信息分析流程，图片来自于《The Journal of Molecular Diagnostics》

生物信息流程的基本步骤

简要地说，一套完整的生物信息流程包含的基本步骤如下：

生物信息流程的基本步骤

1. 测序序列生成（Sequence generation）

步骤包含测序信号处理（signal processing）和碱基序列转换（ base calling），主要是基于测序的光学或其他信号等，生成含有碱基序列（主要是A、C、G、T四种碱基）和每个碱基对应的质量（即碱基读取的可信度）

步骤1基于测序原始信号，得到原始碱基序列（一般用ACGT等核苷酸简称来表示）

2. 基因序列比对（Sequence alignment）

序列比对是把样本测序得到的DNA序列与参考基因组比对，从而发现样本的序列突变和差异。这个过程计算密集，为每个短序列读取分配一个Phred量表映射质量分数（表明比对过程的可信度）以及读取在参考基因组中的物理位置（计算深度和覆盖度）。序列比对结果通常以BAM文件格式存储，是序列比对的SAM格式的二进制版本。文中提到压缩格式CRAM和其加密压缩格式SECRAM，可以有效节省空间（BAM文件通常比较大）：CRAM格式规范（3.0版本），http://samtools./hts-specs/CRAMv3.pdf

步骤2基于原始碱基序列，得到比对参考基因组的结果文件，用于后续分析。

3. 基因变异分析（Variant calling）

基因变异分析是在基因序列比对后，通过分子生物学的基础知识，来判断和筛选样本序列和参考基因组的序列差异，哪些是致病突变，哪些是正常的多态性（简单地理解是无害的）。

这个步骤的输入是BAM或其他类似格式文件，基于目前科学界认定的序列变异的类型来进行判断和分析，包括单核苷酸突变（SNV）、小的插入和缺失（Indel）、拷贝数变化（CNA，有时也叫CNV）和大的结构变化（SV，插入，倒位，易位等），构建算法策略的异构体集合，然后从算法上分析出相应的基因序列变异的集合。这个步骤的准确性高度依赖于碱基质量（步骤1）和比对质量（步骤2）。对于SNV和Indels，常用的文件格式包括：

1）VCF：https://samtools./hts-specs/VCFv4.3.pdf

2）gVCF：https://sites.google.com/site/gvcftools/home/about-gvcf/gvcf-conventions，https://github.com/The-Sequence-Ontology/Specifications/blob/master/gvf.md

3）HGVS格式：http://varnomen./bg-material/simple/

4）GAfGa格式：http:///working-groups/our-work

步骤3基于原始序列，比对结果来检测样本中可能存在的疾病和其他表型相关基因变异（基因突变）。

4. 变异过滤（Variant filtering）

这部分过滤基于数据层面，包括测序数据质量，比对率，深度，覆盖度等。从千或万数量级的序列变异中进行初步过滤。

5. 变异注释（Variant annotation）

基于过滤后基因变异，通过已知的和基因功能相关的数据库，例如COSMIC，TCGA，dbSNP，ClinVar等，对应基因变异映射到的蛋白等其他信息，从而来对变异进行注释。用于对变异的进一步筛选和解读。

6. 变异解读（Variant prioritization）

在所有基本数据分析完成后，必须要结合临床表型、临床知识库，需要在遗传学家、遗传咨询师、临床医师的参与下进一步对得到的若干基因变异进行分析，来判断它们与疾病和表型关联的优先级顺序。

17条共识建议

这17条共识对于生物信息流程的规范以及临床应用的监管非常重要。为确保其准确性和原汁原味，此处不做翻译，引用原文：

Consensus Recommendation Statements for NGS Bioinformatics Pipeline Validation.
#	Statement
1	Clinical laboratories offering NGS-based testing should perform their own validation of the bioinformatics pipeline
2	A qualified medical professional with appropriate training in NGS interpretation and certification must oversee and be involved in the validation process
3	Validation must be performed only after completion of design, development, optimization, and familiarization of the bioinformatics pipeline and its components
4	Bioinformatics pipeline validation should closely emulate the real world environment of the laboratory in which the test is performed
5	Validation should include all individual components of the bioinformatics pipeline used in the analysis, and each component must be reviewed and approved by an appropriately qualified medical molecular professional and the laboratory director
6	The design and implementation of the bioinformatics pipeline must ensure the security of identifiable patient information and be compliant with all applicable laws at the local, state, and national levels
7	Validation of the NGS bioinformatics pipeline must be appropriate and applicable for the intended clinical use, specimen, and variant types detected of the NGS test
8	Laboratories must ensure that the design, implementation, and validation of the bioinformatics pipeline are compliant with applicable laboratory accreditation standards and regulations
9	The bioinformatics pipeline is part of the test procedure, and its components and processes must be documented according to laboratory accreditation standards and regulations
10	The identity of the sample must be preserved throughout each step of the NGS bioinformatics pipeline with a minimum of four unique identifiers including a unique location identifier within the content of each data file read and/or generated by the pipeline
11	Specific quality control and quality assurance parameters must be evaluated during validation and used to determine satisfactory performance of the bioinformatics pipeline
12	The methods used to alter or filter sequence reads at any point in the bioinformatics pipeline prior to interpretation must be validated to ensure that the data presented for interpretation accurately and reproducibly represent the sequence in the specimen, and full documentation of these methods must be kept as part of the test documentation according to laboratory accreditation standards and regulations
13	Laboratories must include specific measures to ensure that each data file generated in the bioinformatics pipeline maintains its integrity and provides alerts for or prevents the use of data files which have been altered in an unauthorized or unintended manner
14	In silico validation can be used to supplement the validation of the bioinformatics pipeline but shall not be used in lieu of end-to-end validation of the bioinformatics pipelines using human samples
15	Validation of the bioinformatics pipeline must include confirmation of a representative set of variants with high quality independent data; appropriate validation metrics by variant type should be reported
16	Clinical laboratories must ensure the accuracy of software-generated HGVS variant nomenclature and annotations and have an alert in place to indicate when the software-generated nomenclature and annotations need to be manually reviewed and/or corrected, and documentation of any corrections must be maintained
17	Supplemental validation is required whenever a significant change is made to any component of the bioinformatics pipeline