NOTE: The GTF specification has been updated to GTF2.2. This new format is backward compatible to GTF2.
GTF2 format (Revised Ensembl GTF)
Gene transfer format. This borrows from GFF,
but has additional structure that warrants a separate definition and format
name.
Structure is as GFF,
so the fields are:
<seqname> <source> <feature> <start> <end> <score>
<strand> <frame> [attributes] [comments]
Here is a simple example with 3 translated exons. Order of rows is not
important.
AB000381 Twinscan CDS 380 401 . + 0 gene_id "001"; transcript_id "001.1";
AB000381 Twinscan CDS 501 650 . + 2 gene_id "001"; transcript_id "001.1";
AB000381 Twinscan CDS 700 707 . + 2 gene_id "001"; transcript_id "001.1";
AB000381 Twinscan start_codon 380 382 . + 0 gene_id "001"; transcript_id "001.1";
AB000381 Twinscan stop_codon 708 710 . + 0 gene_id "001"; transcript_id "001.1";
The whitespace in this example is provided only for readability. In GTF,
fields must be separated by a single TAB and no white space.
<seqname>
The FPC contig ID from the Golden Path.
<source>
The source column should be a unique label indicating where the annotations
came from --- typically the name of either a prediction program or a public
database.
<feature>
The following feature types are required: "CDS", "start_codon", "stop_codon".
The feature "exon" is optional, since this project will not evaluate predicted
splice sites outside of protein coding regions. All other features will
be ignored.
CDS represents the coding sequence starting with the first translated
codon and proceeding to the last translated codon. Unlike Genbank annotation,
the stop codon is not included in the CDS for the terminal exon.
<start> <end>
Integer start and end coordinates of the feature relative to the beginning
of the sequence named in <seqname>. <start> must be less than
or equal to <end>. Sequence numbering starts at 1. Values of <start>
and <end> that extend outside the reference sequence are technically
acceptable, but they are discouraged for purposes of this project.
<score>
The score field will not be used for this project, so you can either
provide a meaningful float or replace it by a dot.
<frame>
0 indicates that the first whole codon of the reading frame is located
at 5'-most base. 1 means that there is one extra base before the first
codon and 2 means that there are two extra bases before the first codon.
Note that the frame is not the length of the CDS mod 3.
Here are the details excised from the GFF
spec. Important: Note comment on reverse strand.
'0' indicates that the specified region is in frame, i.e. that
its first base corresponds to the first base of a codon. '1' indicates
that there is one extra base, i.e. that the second base of the region corresponds
to the first base of a codon, and '2' means that the third base of the
region is the first base of a codon. If the strand is '-', then the
first base of the region is value of <end>, because the corresponding
coding region will run from <end> to <start> on the reverse strand.
[attributes]
All four features have the same two mandatory attributes at the end
of the record:
-
gene_id value; A globally unique identifier
for the genomic source of the transcript
-
transcript_id value; A globally unique identifier
for the predicted transcript.
These attributes are designed for handling multiple transcripts from the
same genomic region. Any other attributes or comments must appear after
these two and will be ignored.
Attributes must end in a semicolon which must then be separated from
the start of any subsequent attribute by exactly one space character (NOT
a tab character).
Textual attributes should be surrounded by doublequotes.
Here is an example of a gene on the negative strand. Larger coordinates
are 5' of smaller coordinates. Thus, the start codon is 3 bp with largest
coordinates among all those bp that fall within the CDS regions. Similarly,
the stop codon is the 3 bp with coordinates just less than the
smallest coordinates within the CDS regions.
AB000123 Twinscan CDS
193817 194022 . -
2 gene_id "AB000123.1"; transcript_id "AB00123.1.2";
AB000123 Twinscan CDS
199645 199752 . -
2 gene_id "AB000123.1"; transcript_id "AB00123.1.2";
AB000123 Twinscan CDS
200369 200508 . -
1 gene_id "AB000123.1"; transcript_id "AB00123.1.2";
AB000123 Twinscan CDS
215991 216028 . -
0 gene_id "AB000123.1"; transcript_id "AB00123.1.2";
AB000123 Twinscan start_codon
216026 216028 . -
. gene_id "AB000123.1"; transcript_id
"AB00123.1.2";
AB000123 Twinscan stop_codon
193814 193816 . -
. gene_id "AB000123.1"; transcript_id
"AB00123.1.2";
Note the frames of the coding exons. For example:
-
The first CDS (from 216028 to 215991) always has frame zero.
-
Frame of the 1st CDS =0, length =38. (frame - length) % 3 =
1, the frame of the 2nd CDS.
-
Frame of the 2nd CDS=1, length=140. (frame - length) % 3 = 2, the
frame of the 3rd CDS.
-
Frame of the 3rd CDS=2, length=108. (frame - length) % 3 =
2, the frame of the terminal CDS.
-
Alternatively, the frame of terminal CDS can be calculated without the
rest of the gene. Length of the terminal CDS=206. length % 3 =2, the frame
of the terminal CDS.
Here is an example in which the "exon" feature is used. It is a 5 exon
gene with 3 translated exons.
AB000381 Twinscan exon
150 200 . + . gene_id
"AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan exon
300 401 . + . gene_id
"AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan CDS
380 401 . + 0 gene_id
"AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan exon
501 650 . + . gene_id
"AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan CDS
501 650 . + 2 gene_id
"AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan exon
700 800 . + . gene_id
"AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan CDS
700 707 . + 2 gene_id
"AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan exon
900 1000 . + . gene_id
"AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan start_codon 380 382
. + 0 gene_id "AB000381.000"; transcript_id
"AB000381.000.1";
AB000381 Twinscan stop_codon 708
710 . + 0 gene_id "AB000381.000";
transcript_id "AB000381.000.1";
|