Vega gene and transcript types
Vega shows annotation from different sources and classifies genes and
transcripts into different classes. Since a single gene often has more
than one transcript, and these transcripts can be of different classes,
the classification of the gene as a whole is defined by the transcipt
with the 'highest' level of classification.
Annotation sources
There are three distinct sets of sources for the annotation of genes and transcripts in Vega. These are
shown as different tracks and have different colour schemes. - Havana Core Genes have been annotated in depth to identify
alternative transcripts, and are present for all species. They have
been annotated by the Havana group at the WTSI(桑格研究院).
- LoF genes
show the consequences of variations in sequence on the functional
properties of human transcripts.
For each of these sets, genes and transcripts are classified as shown
below.
2.Gene Classification
Genes can be classified according to their status,
which indicates the type of evidence that supports the annotation, and
their biotype, an indicator of biological
significance. For simplicity of display, genes are coloured according to
their biotype only, so for example 'Known Protein coding' and 'Novel Protein
coding' genes are both shown in the same shade of blue.
Status:
- Known.
- Identical to known cDNAs or proteins from the same
species and has an entry in species specific model databases: EntrezGene for human dog and pig, MGI for mouse, RGD for rat, and Zfin for Zebrafish.
- Novel. Identical or homologous to cDNAs from the same
species, or proteins from all species.
- Putative. Identical or homologous to spliced ESTs from
the same species.
- Predicted. Based on ab initio prediction and for
which at least one exon is supported by biological data (unspliced
ESTs, protein sequence similarity with mouse or tetraodon genomes or
expression data from Rosetta).
- Genes may have no status shown where this is not applicable, as for example
with the majority of pseudogenes.
Biotype:
- Protein coding. Contains an open reading frame (ORF).
- Polymorphic. A protein coding gene that has at least one transcript with a valid
ORF and one or more coding transcripts that contain a polymorphism.
- Processed transcripts. Doesn't contain an ORF. Divided into three major
categories.
- Long non-coding RNA (lncRNA). Subclassified into one of the following types:
- Non coding. Contains transcripts which are
known from the literature to not be protein coding.
- 3prime_overlapping_ncRNA. Has transcripts where
ditag and/or published experimental data strongly supports the
existence of long (>200bp) non-coding transcripts that overlap the
3'UTR of a protein-coding locus on the same strand.
- Antisense. Has transcripts that overlap the
genomic span (i.e. exon or introns) of a protein-coding locus on
the opposite strand.
- lincRNA (long interspersed ncRNA). Has transcripts that are long
intergenic non-coding RNA locus with a length >200bp. Requires
lack of coding potential and may not be conserved between
species.
- Retained_intron. Has an alternatively spliced transcript
believed to contain intronic sequence relative to other, coding,
variants.
- Sense_intronic. Has a long non-coding transcript
in introns of a coding gene that does not overlap any exons.
- Sense_overlapping. Has a long non-coding
transcript that contains a coding gene in its intron on the
same strand.
- Macro_lncRNA. Unspliced lncRNAs that are
several kb in size.
- Bidirectional lncRNA. A non-coding locus
that originates from within the promoter region of a
protein-coding gene, with transcription proceeding in the
opposite direction on the other strand.
- ncRNA
- miRNA. microRNA
- piRNA. piwi-interacting RNA
- rRNA. ribsosomal RNA
- siRNA. small interfering RNA
- snRNA. small nuclear RNA
- snoRNA. small nucleolar RNA
- tRNA. transfer RNA
- vaultRNA. Short non coding RNA genes that
form part of the vault ribonucleoprotein complex.
- Unclassified processed transcript. Cannot be
placed in one of the other categories.
- Pseudogene. Similar to known proteins but contain a
frameshift and/or stop codon(s) which disrupts the ORF. These can be
classified into the following:
- Processed pseudogene. Pseudogene that lack introns and is
thought to arise from reverse transcription of mRNA followed by
reinsertion of DNA into the genome.
- Unprocessed pseudogene. Pseudogene that can contain
introns since produced by gene duplication.
- Transcribed pseudogene. Pseudogene where protein homology
or genomic structure indicates a pseudogene, but the presence of
locus-specific transcripts indicates expression. These
can be classified into 'Processed',
'Unprocessed' and 'Unitary'.
- Translated pseudogene. Pseudogenes that have
mass spec data suggesting that they are also translated. These
can be classified into 'Processed',
'Unprocessed'
- Polymorphic pseudogene. Pseudogene owing to a SNP/DIP
but in other individuals/haplotypes/strains the gene is
translated.
- Unitary pseudogene. A species specific unprocessed pseudogene without a parent
gene, as it has an active orthologue in another species.
- IG Pseudogene. Inactivated immunoglobulin gene.
- IG Gene. Immunoglobulin gene.
- TR Gene. T cell receptor gene.
- TEC (To be Experimentally Confirmed). This is used for
non-spliced EST clusters that have polyA features. This category has
been specifically created for the ENCODE project to highlight
regions that could indicate the presence of protein coding
genes that require experimental validation, either by 5' RACE or
RT-PCR to extend the transcripts, or by confirming expression of the
putatively-encoded peptide with specific antibodies.
Transcript Classification
Transcripts are classified according to their class.
- Protein coding. Protein coding transcripts. These can be
further classified as follows:
- Known Protein coding. 100% Identical to RefSeq NP or Swiss-Prot entry.
- Novel Protein coding. Shares >60% length with known
coding sequence from RefSeq or Swiss-Prot or has cross-species/family support or domain evidence.
- Putative Protein coding. Shares <60% length with known
coding sequence from RefSeq or Swiss-Prot, or has an alternative
first or last coding exon.
- Nonsense mediated decay. If the coding sequence
(following the appropriate reference) of a transcript finishes >50bp
from a downstream splice site then it is tagged as NMD. If the
variant does not cover the full reference coding sequence then it
is annotated as NMD if NMD is unavoidable i.e. no matter what the
exon structure of the missing portion is the transcript will be
subject to NMD.
- Nonstop decay. Transcripts that have polyA
features (including signal) without a prior stop codon in the CDS,
i.e. a non-genomic polyA tail attached directly to the CDS without
3' UTR. These transcripts are subject to degradation.
- Processed transcripts. Doesn't contain an ORF. Divided into three major
categories.
- Long non-coding RNA (lncRNA). Subclassified into one of the following types:
- Non coding. Contains transcripts which are
known from the literature to not be protein coding.
- 3prime_overlapping_ncRNA. Has transcripts where
ditag and/or published experimental data strongly supports the
existence of long (>200bp) non-coding transcripts that overlap the
3'UTR of a protein-coding locus on the same strand.
- Antisense. Has transcripts that overlap the
genomic span (i.e. exon or introns) of a protein-coding locus on
the opposite strand.
- lincRNA (long interspersed ncRNA). Has transcripts that are long
intergenic non-coding RNA locus with a length >200bp. Requires
lack of coding potential and may not be conserved between
species.
- Retained_intron. Has an alternatively spliced transcript
believed to contain intronic sequence relative to other, coding,
variants.
- Sense_intronic. Has a long non-coding transcript
in introns of a coding gene that does not overlap any exons.
- Sense_overlapping. Has a long non-coding
transcript that contains a coding gene in its intron on the
same strand.
- macro_lncRNA. unspliced lncRNAs that are several kb in size.
- ncRNA
- miRNA. microRNA
- piRNA. piwi-interacting RNA
- rRNA. ribsosomal RNA
- siRNA. small interfering RNA
- snRNA. small nuclear RNA
- snoRNA. small nucleolar RNA
- tRNA. transfer RNA
- vaultRNA. Short non coding RNA genes that
form part of the vault ribonucleoprotein complex.
- Unclassified processed transcript. Cannot be
placed in one of the other categories.
- Pseudogene. Have homology to proteins but generally
suffer from a disrupted coding sequence and an active homologous gene can be
found at another locus. Sometimes these entries have an intact
coding sequence or an open but truncated ORF, in which case there is other evidence
used (for example genomic polyA stretches at the 3' end) to classify
them as a pseudogene. Can be further classified as follows:
- Processed pseudogene. Pseudogene that appears to have
been produced by integration of a reverse transcribed mRNA into
the genome.
- Unprocessed pseudogene. Pseudogene that shows evidence
of loss of function, but has exon-intron structure.
- Transcribed pseudogene. Pseudogene where protein homology
or genomic structure indicates a pseudogene, but the presence of
locus-specific transcripts indicates expression. These
can be classified into 'Processed' and
'Unprocessed'.
- Translated pseudogene. Pseudogenes that have
mass spec data suggesting that they are also translated. These
can be classified into 'Processed' and
'Unprocessed'
- Polymorphic pseudogene. Pseudogene owing to a SNP/DIP
but in other individuals/haplotypes/strains the gene is
translated.
- Unitary pseudogene. A species specific unprocessed pseudogene without a parent
gene, as it has an active orthologue in another species.
- IG Pseudogene. Inactivated immunoglobulin gene.
- TEC (To be Experimentally Confirmed). This is used for
non-spliced EST clusters that have polyA features. This category has
been specifically created for the ENCODE project to highlight
regions that could indicate the presence of novel protein coding
genes that require experimental validation, either by 5' RACE or
RT-PCR to extend the transcripts, or by confirming expression of the
putatively-encoded peptide with specific antibodies.
- Artifact. Used to tag mistakes in the public databases
(Ensembl/SwissProt/ Trembl). Usually these arise from
high-throughput cDNA sequencing projects, which submit automatic
annotation sometimes resulting in erroneous coding sequences that
are, for example, 3' UTRs.
- IG Gene. Immunoglobulin gene.
- TR Gene. T cell receptor gene.
|