GTF2: Mouse/Human Annotation Collaboration: Submission Format

panhoy 2014-06-12

展开全文

NOTE: The GTF specification has been updated to GTF2.2. This new format is backward compatible to GTF2.

GTF2 format (Revised Ensembl GTF)

Gene transfer format. This borrows from GFF, but has additional structure that warrants a separate definition and format name.

NEW! Validating Parser for GTF

Structure is as GFF, so the fields are:
<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments]

Here is a simple example with 3 translated exons. Order of rows is not important.

AB000381 Twinscan  CDS          380   401   .   +   0  gene_id "001"; transcript_id "001.1";
AB000381 Twinscan  CDS          501   650   .   +   2  gene_id "001"; transcript_id "001.1";
AB000381 Twinscan  CDS          700   707   .   +   2  gene_id "001"; transcript_id "001.1";
AB000381 Twinscan  start_codon  380   382   .   +   0  gene_id "001"; transcript_id "001.1";
AB000381 Twinscan  stop_codon   708   710   .   +   0  gene_id "001"; transcript_id "001.1";

The whitespace in this example is provided only for readability. In GTF, fields must be separated by a single TAB and no white space.

<seqname>
The FPC contig ID from the Golden Path.

<source>
The source column should be a unique label indicating where the annotations came from --- typically the name of either a prediction program or a public database.

<feature>
The following feature types are required: "CDS", "start_codon", "stop_codon". The feature "exon" is optional, since this project will not evaluate predicted splice sites outside of protein coding regions. All other features will be ignored.

CDS represents the coding sequence starting with the first translated codon and proceeding to the last translated codon. Unlike Genbank annotation, the stop codon is not included in the CDS for the terminal exon.

<start> <end>
Integer start and end coordinates of the feature relative to the beginning of the sequence named in <seqname>. <start> must be less than or equal to <end>. Sequence numbering starts at 1. Values of <start> and <end> that extend outside the reference sequence are technically acceptable, but they are discouraged for purposes of this project.

<score>
The score field will not be used for this project, so you can either provide a meaningful float or replace it by a dot.

<frame>
0 indicates that the first whole codon of the reading frame is located at 5'-most base. 1 means that there is one extra base before the first codon and 2 means that there are two extra bases before the first codon. Note that the frame is not the length of the CDS mod 3.

Here are the details excised from the GFF spec. Important: Note comment on reverse strand.

'0' indicates that the specified region is in frame, i.e. that its first base corresponds to the first base of a codon. '1' indicates that there is one extra base, i.e. that the second base of the region corresponds to the first base of a codon, and '2' means that the third base of the region is the first base of a codon. If the strand is '-', then the first base of the region is value of <end>, because the corresponding coding region will run from <end> to <start> on the reverse strand.

[attributes]
All four features have the same two mandatory attributes at the end of the record:

gene_id value; A globally unique identifier for the genomic source of the transcript
transcript_id value; A globally unique identifier for the predicted transcript.

These attributes are designed for handling multiple transcripts from the same genomic region. Any other attributes or comments must appear after these two and will be ignored.

Attributes must end in a semicolon which must then be separated from the start of any subsequent attribute by exactly one space character (NOT a tab character).

Textual attributes should be surrounded by doublequotes.

Here is an example of a gene on the negative strand. Larger coordinates are 5' of smaller coordinates. Thus, the start codon is 3 bp with largest coordinates among all those bp that fall within the CDS regions. Similarly, the stop codon is the 3 bp with coordinates just less than the smallest coordinates within the CDS regions.

AB000123    Twinscan     CDS    193817    194022    .    -    2    gene_id "AB000123.1"; transcript_id "AB00123.1.2";
AB000123    Twinscan     CDS    199645    199752    .    -    2    gene_id "AB000123.1"; transcript_id "AB00123.1.2";
AB000123    Twinscan     CDS    200369    200508    .    -    1    gene_id "AB000123.1"; transcript_id "AB00123.1.2";
AB000123    Twinscan     CDS    215991    216028    .    -    0    gene_id "AB000123.1"; transcript_id "AB00123.1.2";
AB000123    Twinscan     start_codon   216026    216028    .    -    .    gene_id    "AB000123.1"; transcript_id "AB00123.1.2";
AB000123    Twinscan     stop_codon    193814    193816    .    -    .    gene_id    "AB000123.1"; transcript_id "AB00123.1.2";

Note the frames of the coding exons. For example:

The first CDS (from 216028 to 215991) always has frame zero.
Frame of the 1st CDS =0, length =38. (frame - length) % 3 = 1, the frame of the 2nd CDS.
Frame of the 2nd CDS=1, length=140. (frame - length) % 3 = 2, the frame of the 3rd CDS.
Frame of the 3rd CDS=2, length=108. (frame - length) % 3 = 2, the frame of the terminal CDS.
Alternatively, the frame of terminal CDS can be calculated without the rest of the gene. Length of the terminal CDS=206. length % 3 =2, the frame of the terminal CDS.

Here is an example in which the "exon" feature is used. It is a 5 exon gene with 3 translated exons.

AB000381 Twinscan exon         150   200   .   +   . gene_id "AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan exon         300   401   .   +   . gene_id "AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan CDS          380   401   .   +   0 gene_id "AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan exon         501   650   .   +   . gene_id "AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan CDS          501   650   .   +   2 gene_id "AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan exon         700   800   .   +   . gene_id "AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan CDS          700   707   .   +   2 gene_id "AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan exon         900 1000   .   +   . gene_id "AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan start_codon 380   382   .   +   0 gene_id "AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan stop_codon   708   710   .   +   0 gene_id "AB000381.000"; transcript_id "AB000381.000.1";