This document describes "regular" VCF files produced for GERMLINE short variant (SNP and indel) calls (e.g. by HaplotypeCaller in "normal" mode and by GenotypeGVCFs). For information on the special kind of VCF called GVCF produced by HaplotypeCaller in Contents
1. OverviewVCF, or Variant Call Format, It is a standardized text file format used for representing SNP, indel, and structural variation calls. The VCF specification used to be maintained by the 1000 Genomes Project, but its management and further development has been taken over by the Genomic Data Toolkit team of the Global Alliance for Genomics and Health. The full format spec can be found in the Samtools/Hts-specs repository, along with other useful specifications like SAM/BAM/CRAM. We highly encourage you to take a look at those documents, as they contain a lot of useful information that we do not go over in this document. VCF is the primary (and only well-supported) format used by the GATK for variant calls. We prefer it above all others because while it can be a bit verbose, the VCF format is very explicit about the exact type and sequence of variation as well as the genotypes of multiple samples for this variation. That being said, this highly detailed information can be challenging to understand. The information provided by the GATK tools that infer variation from high-throughput sequencing data, such as the HaplotypeCaller, is especially complex. This document describes the key features and annotations that you need to know about in order to understand VCF files output by the GATK tools. Note that VCF files are plain text files, so you can open them for viewing or editing in any text editor, with the following caveats:
2. Structure of a VCF fileA valid VCF file is composed of two main parts: the header, and the variant call records. The header contains information about the dataset and relevant reference sources (e.g. the organism, genome build version etc.), as well as definitions of all the annotations used to qualify and quantify the properties of the variant calls contained in the VCF file. The header of VCFs generated by GATK tools also include the command line that was used to generate them. Some other programs also record the command line in the VCF header, but not all do so as it is not required by the VCF specification. For more information about the header, see the next section. The actual data lines will look something like this: [HEADER LINES] #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 20 10001019 . T G 364.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.699;ClippingRankSum=0.00;DP=34;ExcessHet=3.0103;FS=3.064;MLEAC=1;MLEAF=0.500;MQ=42.48;MQRankSum=-3.219e+00;QD=11.05;ReadPosRankSum=-6.450e-01;SOR=0.537 GT:AD:DP:GQ:PL 0/1:18,15:33:99:393,0,480 20 10001298 . T A 884.77 . AC=2;AF=1.00;AN=2;DP=30;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=29.49;SOR=1.765 GT:AD:DP:GQ:PL 1/1:0,30:30:89:913,89,0 20 10001436 . A AAGGCT 1222.73 . AC=2;AF=1.00;AN=2;DP=29;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=25.36;SOR=0.836 GT:AD:DP:GQ:PL 1/1:0,28:28:84:1260,84,0 20 10001474 . C T 843.77 . AC=2;AF=1.00;AN=2;DP=27;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=31.25;SOR=1.302 GT:AD:DP:GQ:PL 1/1:0,27:27:81:872,81,0 20 10001617 . C A 493.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=1.63;ClippingRankSum=0.00;DP=38;ExcessHet=3.0103;FS=1.323;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=12.99;ReadPosRankSum=0.170;SOR=1.179 GT:AD:DP:GQ:PL 0/1:19,19:38:99:522,0,480 After the header lines and the field names, each line represents a single variant, with various properties of that variant represented in the columns. Note that all the lines shown in the example above describe SNPs and indels, but other variation types could be described (see the VCF specification for details). Depending on how the callset was generated, there may only be records for sites where a variant was identified, or there may also be "invariant" records, ie records for sites where no variation was identified. You will sometimes come across VCFs that have only 8 columns, and contain no FORMAT or sample-specific information. These are called "sites-only" VCFs, and represent variation that has been observed in a population. Generally, information about the population of origin should be included in the header. 3. Interpreting the header informationThe following is a valid VCF header produced by GenotypeGVCFs on an example data set (derived from our favorite test sample, NA12878). You can download similar test data from our resource bundle and try looking at it yourself. ##fileformat=VCFv4.2 ##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location"> ##FILTER=<ID=LowQual,Description="Low quality"> ##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> ##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another"> ##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group"> ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification"> ##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)"> ##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias."> ##GATKCommandLine.HaplotypeCaller=<ID=HaplotypeCaller,Version=3.7-0-gcfedb67,Date="Fri Jan 20 11:14:15 EST 2017",Epoch=1484928855435,CommandLineOptions="[command-line goes here]"> ##GATKCommandLine=<ID=GenotypeGVCFs,CommandLine="[command-line goes here]",Version=4.beta.6-117-g4588584-SNAPSHOT,Date="December 23, 2017 5:45:56 PM EST"> ##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed"> ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes"> ##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities"> ##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered"> ##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?"> ##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval"> ##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity"> ##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias"> ##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes"> ##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation"> ##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed"> ##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed"> ##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality"> ##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities"> ##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth"> ##INFO=<ID=RAW_MQ,Number=1,Type=Float,Description="Raw data for RMS Mapping Quality"> ##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias"> ##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias"> ##contig=<ID=20,length=63025520> ##reference=file:///data/ref/ref.fasta ##source=GenotypeGVCFs This is a lot of lines, so let us break it down into digestible bits. Note that the header lines are always listed in alphabetical order. VCF spec versionThe first line: ##fileformat=VCFv4.2 tells you the version of the VCF specification to which the file conforms. This may seem uninteresting but it can have some important consequences for how to handle and interpret the file contents. As genomics is a fast moving field, the file formats are evolving fairly rapidly, so some of the encoding conventions change. If you run into unexpected issues while trying to parse a VCF file, be sure to check the version and the spec for any relevant format changes. FILTER linesThe FILTER lines tell you what filters have been applied to the data. In our test file, one filter has been applied: ##FILTER=<ID=LowQual,Description="Low quality"> Records that fail any of the filters listed here will contain the ID of the filter (here, FORMAT and INFO linesThese lines define the annotations contained in the GATKCommandLineThe GATKCommandLine lines contain all the parameters that went used by the tool that generated the file. Here, Contig lines and ReferenceThese contain the contig names, lengths, and which reference assembly was used with the input BAM file. This can come in handy when someone gives you a callset but does not tell you which reference it was derived from -- remember that for many organisms, there are multiple reference assemblies, and you should always make sure to use the appropriate one! For more information on genome references, see the corresponding Dictionary entry. 4. Structure of variant call recordsFor each site record, the information is structured into columns (also called fields) as follows: #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 [other samples...] The first 8 columns of the VCF records (up to and including Sample-specific information such as genotype and individual sample-level annotation values are contained in the Site-level properties and annotationsThese first 7 fields are required by the VCF format and must be present, although they can be empty (in practice, there has to be a dot, ie CHROM and POSThe contig and genomic coordinates on which the variant occurs. Note that for deletions the position given is actually the base preceding the event. IDAn optional identifier for the variant. Based on the contig and position of the call and whether a record exists at this site in a reference database such as dbSNP. A typical identifier is the dbSNP ID, which in human data would look like rs28548431, for example. REF and ALTThe reference allele and alternative allele(s) observed in a sample, set of samples, or a population in general (depending how the VCF was generated). The REF and ALT alleles are the only required elements of a VCF record that tell us whether the variant is a SNP or an indel (or in complex cases, a mixed-type variant). If we look at the following two sites, we see the first is a SNP, the second is an insertion and the third is a deletion: 20 10001298 . T A 884.77 . [CLIPPED] GT:AD:DP:GQ:PL 1/1:0,30:30:89:913,89,0 20 10001436 . A AAGGCT 1222.73 . [CLIPPED] GT:AD:DP:GQ:PL 1/1:0,28:28:84:1260,84,0 20 10004769 . TAAAACTATGC T 622.73 . [CLIPPED] GT:AD:DP:GQ:PL 0/1:18,17:35:99:660,0,704 Note that REF and ALT are always given on the forward strand. For insertions, the ALT allele includes the inserted sequence as well as the base preceding the insertion so you know where the insertion is compared to the reference sequence. For deletions, the ALT allele is the base before the deletion. QUALThe Phred-scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. Because the Phred scale is -10 * log(1-p), a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance (see the Technical Documentation). These values can grow very large when a large amount of data is used for variant calling, so QUAL is not often a very useful property for evaluating the quality of a variant call. See our documentation on filtering variants for more information on this topic. Not to be confused with the sample-level annotation GQ; see this FAQ article for an explanation of the differences in what they mean and how they should be used. FILTERThis field contains the name(s) of any filter(s) that the variant fails to pass, or the value INFOVarious site-level annotations. This field is not required to be present in the VCF. The annotations contained in the INFO field are represented as tag-value pairs, where the tag and value are separated by an equal sign, ie Sample-level annotationsAt this point you have met all the fields up to INFO in this lineup: #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 [other samples...] All the rest is going to be sample-level information. Sample-level annotations are tag-value pairs, like the INFO annotations, but the formatting is a bit different. The short names of the sample-level annotations are recorded in the 5. Interpreting genotype and other sample-level informationThe sample-level information contained in the VCF (also called "genotype fields") may look a bit complicated at first glance, but they are actually not that hard to interpret once you understand that they are just sets of tags and values. Let us take a look at three of the records shown earlier, simplified to just show the key genotype annotations: 20 10001019 . T G 364.77 . [CLIPPED] GT:AD:DP:GQ:PL 0/1:18,15:33:99:393,0,480 20 10001298 . T A 884.77 . [CLIPPED] GT:AD:DP:GQ:PL 1/1:0,30:30:89:913,89,0 20 10001436 . A AAGGCT 1222.73 . [CLIPPED] GT:AD:DP:GQ:PL 1/1:0,28:28:84:1260,84,0 Looking at that last column, here is what the tags mean: GTThe genotype of this sample at this site. For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. When there is a single ALT allele (by far the more common case), GT will be either: - 0/0 : the sample is homozygous reference - 0/1 : the sample is heterozygous, carrying 1 copy of each of the REF and ALT alleles - 1/1 : the sample is homozygous alternate In the three sites shown in the example above, NA12878 is observed with the allele combinations T/G, A/A and AAGGCT/AAGGCT respectively. For non-diploids, the same pattern applies; in the haploid case there will be just a single value in GT (e.g. AD and DPAllele depth (AD) and depth of coverage (DP). These are complementary fields that represent two important ways of thinking about the depth of the data for this sample at this site. AD is the unfiltered allele depth, i.e. the number of reads that support each of the reported alleles. All reads at the position (including reads that did not pass the variant caller’s filters) are included in this number, except reads that were considered uninformative. Reads are considered uninformative when they do not provide enough statistical evidence to support one allele over another. DP is the filtered depth, at the sample level. This gives you the number of filtered reads that support each of the reported alleles. You can check the variant caller’s documentation to see which filters are applied by default. Only reads that passed the variant caller’s filters are included in this number. However, unlike the AD calculation, uninformative reads are included in DP. See the Tool Documentation for more details on AD (DepthPerAlleleBySample) and DP (Coverage) for more details. PL"Normalized" Phred-scaled likelihoods of the possible genotypes. For the typical case of a monomorphic site (where there is only one ALT allele) in a diploid organism, the PL field will contain three numbers, corresponding to the three possible genotypes (0/0, 0/1, and 1/1). The PL values are "normalized" so that the PL of the most likely genotype (assigned in the GT field) is 0 in the Phred scale. We use "normalized" in quotes because these are not probabilities. We set the most likely genotype PL to 0 for easy reading purpose.The other values are scaled relative to this most likely genotype. Keep in mind, if you are not familiar with the statistical lingo, that when we say PL is the "Phred-scaled likelihood of the genotype", we mean it is "How much less likely that genotype is compared to the best one". Have a look at this article for an example of how PL is calculated. GQThe Genotype Quality represents the Phred-scaled confidence that the genotype assignment (GT) is correct, derived from the genotype PLs. Specifically, the GQ is the difference between the PL of the second most likely genotype, and the PL of the most likely genotype. As noted above, the values of the PLs are normalized so that the most likely PL is always 0, so the GQ ends up being equal to the second smallest PL, unless that PL is greater than 99. In GATK, the value of GQ is capped at 99 because larger values are not more informative, but they take more space in the file. So if the second most likely PL is greater than 99, we still assign a GQ of 99. Basically the GQ gives you the difference between the likelihoods of the two most likely genotypes. If it is low, you can tell there is not much confidence in the genotype, i.e. there was not enough evidence to confidently choose one genotype over another. See the FAQ article on the Phred scale to get a sense of what would be considered low. Not to be confused with the site-level annotation QUAL; see this FAQ article for an explanation of the differences in what they mean and how they should be used. A few examplesWith all the definitions out of the way, let us interpret the genotype information for a few records from our NA12878 callset, starting with at position 10001019 on chromosome 20: 20 10001019 . T G 364.77 . [CLIPPED] GT:AD:DP:GQ:PL 0/1:18,15:33:99:393,0,480 At this site, the called genotype is Now let us look at a site where our confidence is quite a bit lower: 20 10024300 . C CTT 43.52 . [CLIPPED] GT:AD:DP:GQ:PL 0/1:1,4:6:20:73,0,20 Here we have an indel -- specifically an insertion of Finally, let us look at a more complicated example: 20 10009875 . A G,AGGGAGG 1128.77 . [CLIPPED] GT:AD:DP:GQ:PL 1/2:0,11,5:16:99:1157,230,161,487,0,434 This site is a doozy; two credible ALT alleles were observed, but the REF allele was not -- so technically this is a biallelic site in our sample, but will be considered multiallelic because there are more than two alleles notated in the record. It is also a mixed-type record, since one of the ALTs by itself would make it an 6. Basic operations: validating, subsetting and exporting from a VCFThese are a few common things you may want to do with your VCFs that do not deserve their own tutorial. Let us know if there are other operations you think we should cover here. Validate your VCFValidation, or checking that the format of the file is correct, follows the specification, and will therefore not break any well-behave tool you choose to run on it. You can do this very simply with ValidateVariants. Note that ValidateVariants can also be used on GVCFs if you use the Subset records from your VCFSometimes you want to subset just one or a few samples from a big cohort. Sometimes you want to subset to just a genomic region. Sometimes you want to do both at the same time! Well, the same tool can do both, and more; it is called SelectVariants and has a lot of options for doing this in that way (including operating over intervals in the usual way). There are many options for setting the selection criteria, depending on what you want to achieve. For example, given a single VCF file, one or more samples can be extracted from the file, based either on a complete sample name, or on a pattern match. Variants can also be selected based on annotated properties, such as depth of coverage or allele frequency. This is done using JEXL expressions. Other VCF files can also be used to modify the selection based on concordance or discordance between different callsets (see --discordance / --concordance arguments in the Tool Doc. Important notes about subsetting operations
Extract information from a VCF in a sane, (mostly) straightforward wayUse VariantsToTable. No, really, do not write your own parser if you can avoid it. This is not a comment on how smart or how competent we think you are -- it is a comment on how annoyingly obtuse and convoluted the VCF format is. Seriously. The VCF format lends itself really poorly to parsing methods like regular expressions, and we hear sob stories all the time from perfectly competent people whose home-brewed parser broke because it could not handle a more esoteric feature of the format. We know we broke a bunch of people's scripts when we introduced a new representation for spanning deletions in multisample callsets. OK, we ended up replacing it with a better representation a month later that was a lot less disruptive and more in line with the spirit of the specification -- but the point is, that first version was technically legal according to the 4.2 spec, and that sort of thing can happen at any time. So yes, the VCF is a difficult format to work with, and one way to deal with that safely is to not home-brew parsers. (Why are we sticking with it anyway? Because, as Winston Churchill famously put it, VCF is the worst variant call representation, except for all the others.) 7. Merging VCF filesThere are three main reasons why you might want to combine variants from different files into one, and the tool you use depends on what you are trying to achieve.
There is actually one more reason why you might want to combine variants from different files into one, but we do not recommend doing it: you have produced variant calls from various samples separately, and want to combine them for analysis. This is how people used to do variant analysis on large numbers of samples, but we do not recommend proceeding this way because that workflow suffers from serious methodological flaws. Instead, you should follow our recommendations as laid out in the Best Practices documentation. |
|
来自: BIOINFO_J > 《variation_call》