分享

MG

 追着天使拔毛 2020-07-19

MG-RAST user manual

Motivation

MG-RAST provides Science as a Service for environmental DNA (“metagenomic sequences”) at https://.

The National Human Genome Research Institute (NHGRI), a division of the National Institutes of Health, publishes information (see Figure [fig:cost_per_megabase]) describing the development of computing costs and DNA sequencing costs over time (Institute 2012). The dramatic gap between the shrinking costs of sequencing and the more or less stable costs of computing is a major challenge for biomedical researchers trying to use next-generation DNA sequencing platforms to obtain information on microbial communities. Wilkening et al. (Wilkening et al. 2009) provide a real currency cost for the analysis of 100 gigabasepairs of DNA sequence data using BLASTX on Amazon’s EC2 service: $300,000. [1] A more recent study by University of Maryland researchers (Angiuoli et al. 2011) estimates the computation for a terabase of DNA shotgun data using their CLOVR metagenome analysis pipeline at over $5 million per terabase.

Chart showing shrinking cost for DNA sequencing. This comparison with Moore’s law roughly describing the development of computing costs highlights the growing gap between sequence data and the available analysis resources. Source: NHGRI (Institute 2012)

Chart showing shrinking cost for DNA sequencing. This comparison with Moore’s law roughly describing the development of computing costs highlights the growing gap between sequence data and the available analysis resources. Source: NHGRI (Institute 2012)

Nevertheless, the growth in data enabled by next-generation sequencing platforms also provides an exciting opportunity for studying microbial communities:  99% of the microbes in which have not yet been cultured (Riesenfeld, Schloss, and Handelsman 2004). Cultivation-free methods (often summarized as metagenomics) offer novel insights into the biology of the vast majority of life on Earth (Thomas, Gilbert, and Meyer 2012).

Several types of studies use DNA for environmental analyses:

  1. Environmental clone libraries (functional metagenomics): use of Sanger sequencing (frequently) instead of more cost-efficient next-generation sequencing
  2. Amplicon metagenomics (single gene studies, 16s rDNA): next-generation sequencing of PCR amplified ribosomal genes providing a single reference gene–based view of microbial community ecology
  3. Shotgun metagenomics: use of next-generation technology applied directly to environmental samples
  4. Metatranscriptomics: use of cDNA transcribed from mRNA

Each of these methods has strengths and weaknesses (see (Thomas, Gilbert, and Meyer 2012)), as do the various sequencing technologies (see (Loman et al. 2012)).

To support user-driven analysis of all types of metagenomic data, we have provided MG-RAST (Meyer et al. 2008) to enable researchers to study the function and composition of microbial communities. The MG-RAST portal offers automated quality control, annotation, comparative analysis, and archiving services. At the time of writing MG-RAST has completed the analysis of over 100 terabasepairs of DNA data in over 250,000 datasets contributed by thousands of researchers worldwide.
The MG-RAST system provides answers to the following scientific questions:
  • Who is out there? Identifying the composition of a microbial community either by using amplicon data for single genes or by deriving community composition from shotgun metagenomic data using sequence similarities.
  • What are they doing? Using shotgun data (or metatranscriptomic data) to derive the functional complement of a microbial community using similarity searches against a number of databases.
  • Who is doing what? Based on sequence similarity searches, identifying the organisms encoding specific functions.

The system supports the analysis of the prokaryotic content of samples, analysis of viruses and eukaryotic sequences is not currently supported, due to software limitations.

MG-RAST users can upload raw sequence data in fastq, fasta and sff format; the sequences will be normalized (quality controlled) and processed and summaries automatically generated. The server provides several methods to access the different data types, including phylogenetic and metabolic reconstructions, and the ability to compare the metabolism and annotations of one or more metagenomes, individually or in groups. Access to the data is password protected unless the owner has made it public, and all data generated by the automated pipeline is available for download in variety of common formats.

Brief description

The MG-RAST pipeline performs quality control, protein prediction, clustering and similarity-based annotation on nucleic acid sequence datasets using a number of bioinformatics tools (see Section 13.2.1. MG-RAST was built to analyze large shotgun metagenomic data sets ranging in size from megabases to terabases. We also support amplicon (16S, 18S, and ITS) sequence datasets and metatranscriptome (RNA-seq) sequence datasets. The current MG-RAST pipeline is not capable of predicting coding regions from eukaryotes and thus will be of limited use for eukaryotic shotgun metagenomes and/or the eukaryotic subsets of shotgun metagenomes.

Data on MG-RAST is private to the submitting user unless shared with other users or made public by the user. We strongly encourage the eventual release of data and require metadata (“data describing data”) for data sharing or publication. Data submitted with metadata will be given priority for the computational queue.

You need to provide (raw or assembled) nucleotide sequence data and sample descriptions (“metadata”). The system accepts sequence data in FASTA, FASTQ and SFF format and metadata in the form or GSC ( http:/// ) standard compliant checklists (see Yilmaz et al, Nature Biotech, 2011). Uploads can be put in the system via either the web interface or a command line tool. Data and metadata are validated after upload.

You must choose quality control filtering options at the time you submit your job. MG-RAST provides several options for quality control (QC) filtering for nucleotide sequence data, including removal of artificial duplicate reads, quality-based read trimming, length-based read trimming, and screening for DNA of model organisms (or humans). These filters are applied before the data are submitted for annotation.

The MG-RAST pipeline assigns an accession number and puts the data in a queue for computation. The similarity search step is computationally expensive. Small jobs can complete as fast as hours, while large jobs can spend a week waiting in line for computational resources.

MG-RAST performs a protein similarity search between predicted proteins and database proteins (for shotgun) and a nucleic-acid similarity search (for reads similar to 16S and 18S sequences).

MG-RAST presents the annotations via the tools on the analysis page which prepare, compare, display, and export the results on the website. The download page offers the input data, data at intermediate stages of filtering, the similarity search output, and summary tables of functions and organisms detected.

MG-RAST can compare thousands of data sets run through a consistent annotation pipeline. We also provide a means to view annotations in multiple different namespaces (e.g. SEED functions, K.O. Terms, Cog Classes, EGGnoggs) via the M5Nr.

The publication “Metagenomics-a guide from sampling to data analysis” (PMID 22587947) in Microbial Informatics and Experimentation, 2012 is a good review of best practices for experiment design for further reading.

License

Citing MG-RAST

A significant number of papers have been published about MG-RAST itself and the supporting platform, however we ask that if you use the system please cite:
The Metagenomics RAST server — A public resource for the automatic phylogenetic and functional analysis of metagenomes
F. Meyer, D. Paarmann, M. D’Souza, R. Olson , E. M. Glass, M. Kubal, T. Paczian, A. Rodriguez, R. Stevens, A. Wilke, J. Wilkening, and R. A. Edwards
BMC Bioinformatics 2008, 9:386

http://www./1471-2105/ 9/386

.

In addition if you also use the API please cite:
A RESTful API for Accessing Microbial Community Data for MG-RAST
A. Wilke, J. Bischof, T. Harrison, T. Brettin, M. D’Souza, W. Gerlach, H. Matthews, T. Paczian, J. Wilkening, E. M. Glass, N. Desai, F. Meyer
PLOS Comp Biology 2015, DOI: 10.1371/journal.pcbi.1004008

http://journals./ploscompbiol/article?id=10.1371/journal.pcbi.1004008

.

Version history

Version 1

The original version of MG-RAST was developed in 2007 by Folker Meyer, Andreas Wilke, Daniel Paarman, Bob Olson, and Rob Edwards. It relied heavily on the SEED(Overbeek et al. 2005) environment and allowed upload of preprocessed 454 and Sanger data.

Version 2

Version 2, released in 2008, had numerous improvements. It was optimized to handle full-sized 454 datasets and was the first version of MG-RAST that was not fully SEED based. Version 2.0 used BLASTX analysis for both gene prediction and functional classification(Meyer et al. 2008).

Version 3

While version 2 of MG-RAST was widely used, it was limited to datasets smaller than a few hundred megabases, and comparison of samples was limited to pairwise comparisons. Version 3 is not based on SEED technology; instead, it uses the SEED subsystems as a preferred data source. Starting with version 3, MG-RAST moved to github.

Version 3.6

With version 3.6 MG-RAST was containerized, moving from a bare metal infrastructure to a set of docker containers running in a Fleet/SystemD/etcD environment.

Version 4

Version 4.0 brings a new web interface, fully relying on the API for data access and moves the bulk of the data stored from Postgres to Cassandra. The new web interface moves the data visualization burden from the web server to the clients machine, using Javascript and HTML5 heavily.

In version 4.0 we have moved the changed the backend store for profiles. While previous version stored a pre-computed mapping of observed abundances to functional or taxonomic categories, this is now computed on the fly. The number of profiles stored is reduced to the MD5 and LCA profiles. The API has been augmented to allow dynamic mapping to categories, to provide the required bandwidth we have migrated the profile store from Postgres to Cassandra.

The web interface of the previous version predated the API, the user interface for version 4.0 now uses the API. The web interface has been re-written in JavaScript/HTML5. Unlike previous version the web interface now is executed on the client (inside the browser) and now soupports any recent browser.

With version 4.04 we are switching the main web site to be and are also turning on https by default. For a limited time, the unencrypted access protocols will remain available. We encourage all users to upgrade their bookmarks and also install upgraded versions of the CRAN package and/or the python tool suite. We also switched the similarity tool to Diamond(Buchfink, Xie, and Huson 2015).

Comparison of versions 2 and 3

Version 3 added the ability to analyze massive amounts of Illumina reads by introducing a significant number of changes to the pipeline and the underlying platform technology. In version 3 we introduced the notion of the API as the central component of the system.

In the 3.0 version, datasets of tens of gigabases can be annotated, and comparison of taxa or functions that differ between samples is now limited only by the available screen real estate. Figure 1.1 shows a comparison of the analytical and computational approaches used in MG-RAST v2 and v3. The major changes are the inclusion of a dedicated gene-calling stage using FragGenescan (Rho, Tang, and Ye 2010), clustering of predicted proteins at 90% identified by using uclust (Edgar 2010), and the use of BLAT (Kent 2002) for the computation of similarities. Together with changes in the underlying infrastructure, this version has allowed dramatic scaling of the analysis with the limited hardware available.

Similar to version 2.0, the new version of MG-RAST does not pretend to know the correct parameters for the transfer of annotations. Instead, users are empowered to choose the best parameters for their datasets.

Comparison of versions 3 and 4

The roadmap for version 4 has a number of key elements that will be implemented step-by-step, currently the following features are implemented:

  • New JavaScript web interface using the API
  • Cassandra instead of Postgres as main data store for profiles
Overview of processing pipeline in (left) MG-RAST v2 and (right) MG-RAST v3. In the old pipeline, metadata was rudimentary, compute steps were performed on individual reads on a 40-node cluster that was tightly coupled to the system, and similarities were computed by BLAST to yield abundance profiles that could then be compared on a per sample or per pair basis. In the new pipeline, rich metadata can be uploaded, normalization and feature prediction are performed, faster methods such as BLAT are used to compute similarities, and the resulting abundance profiles are fed into downstream pipelines on the cloud to perform community and metabolic reconstruction and to allow queries according to rich sample and functional metadata.

Overview of processing pipeline in (left) MG-RAST v2 and (right) MG-RAST v3. In the old pipeline, metadata was rudimentary, compute steps were performed on individual reads on a 40-node cluster that was tightly coupled to the system, and similarities were computed by BLAST to yield abundance profiles that could then be compared on a per sample or per pair basis. In the new pipeline, rich metadata can be uploaded, normalization and feature prediction are performed, faster methods such as BLAT are used to compute similarities, and the resulting abundance profiles are fed into downstream pipelines on the cloud to perform community and metabolic reconstruction and to allow queries according to rich sample and functional metadata.

The new version of MG-RAST represents a rethinking of core processes and data products, as well as new user interface metaphors and a redesigned computational infrastructure. MG-RAST supports a variety of user-driven analyses, including comparisons of many samples, previously too computationally intensive to support for an open user community.

Scaling to the new workload required changes in two areas: the underlying infrastructure needed to be rethought, and the analysis pipeline needed to be adapted to address the properties of the newest sequencing technologies.

The MG-RAST team

MG-RAST was started by Rob Edwards and Folker Meyer in 2007. The MG-RAST team has significantly expanded in the past few years. The team is listed below.

  • Andreas Wilke
  • Wolfgang Gerlach
  • Travis Harrison
  • William L. Trimble
  • Folker Meyer

MG-RAST alumni

The following people were associated with MG-RAST in the past:

  • Daniel Paarmann, 2007-2008
  • Rob Edwards, 2007-2008
  • Mike Kubal, 2007-2008
  • Alex Rodriguez, 2007-2008
  • Bob Olson, 2007-2009
  • Daniela Bartels, 2007-2011
  • Yekaterina Dribinsky, 2011
  • Jared Wilkening, 2007-2013
  • Mark D’Souza, 2007-2014
  • Hunter Matthews 2009-2014
  • Narayan Desai, 2011-2014
  • Wei Tang, 2012-2015
  • Daniel Braithwaite, 2012-2015
  • Elizabeth M. Glass, 2008-2016
  • Jared Bischof, 2010-2016
  • Kevin Keegan, 2009-2016
  • Tobias Paczian 2007 - 2018

Under the hood: The MG-RAST technology platform

The backend

While originally MG-RAST data was stored in a shared filesystem and a MySQL database, the backend store evolved with growing popularity and demand.

Currently a number of data stores are used to provide the underpinning for various parts of the MG-RAST API.

An approximate mapping of stores to functions in version 4.0 is provided in table [xtab:v4-stores-to-API].

Mapping of API functions to data stores
Function data store comment
Search Apache, SOLR and elastic search  
Profiles Cassandra and SHOCK  
M5NR Cassandra  
Authentication MySQL  
Project MySQL  
Access control MySQL  
Metadata MySQL  
Files SHOCK  

The backend infrastructure and the overall system layout is shown in figure 2.1.

Overview of the production system in mid 2016. Fleet is used to manage a number of containerized services (shown with dashed lines). Two services are provisioned outside the Fleet system: SHOCK (providing 0.7 Petabyte of storage) and a Postgres clusters. We note the significant number of different databases used to serve data required for the API.

Overview of the production system in mid 2016. Fleet is used to manage a number of containerized services (shown with dashed lines). Two services are provisioned outside the Fleet system: SHOCK (providing 0.7 Petabyte of storage) and a Postgres clusters. We note the significant number of different databases used to serve data required for the API.

As of version 3.6 the majority of the services are provisioned as containers, provisioned as a set of Fleet units described in https://github.com/MG-RAST/MG-RAST-infrastructure/tree/master/fleet-units.

The supporting technologies: Skyport, AWE and SHOCK

One key aspect of scaling MG-RAST to large numbers of modern NGS datasets is the use of cloud computing [2], which decouples MG-RAST from its previous dedicated hardware resources.

We use AWE (Wilke et al. 2011) an efficient, open source resource manager to execute the MG-RAST workflow. We expanded AWE to work with Linux containers forming the Skyport system (Gerlach et al. 2014). AWE and Skyport use RESTful interfaces thus allowing the addition of clients without the need to add firewall exceptions and/or massive system reconfiguration.

The main MG-RAST data store is the the SHOCK data management system (Wilke et al. 2015) developed alongside AWE. SHOCK like AWE relies on a RESTful interface instead of a more traditional shared file system.

When we introduced the technologies described above to replace a shared file system (Sun NFS mounted on several hundred nodes), we saw a speed up of a factor of 750x on identical hardware.

Data model

The MG-RAST data model (see Figure 2.2) has changed dramatically in order to handle the size of modern next-generation sequencing datasets. In particular, we have made a number of choices that reduce the computational and storage burden.

We note that the size of the derived data products for a next-generation dataset in MG-RAST is typically about 10x the size of the actual dataset. Individual datasets now may be as large as a terabase [3], with the on-disk footprint significantly larger than the basepair count because of the inefficient nature of FASTQ files, which basically double the on-disk size for FASTQ representations.

  • Abundance profiles. Using abundance profiles, where we count the number of occurrences of function or taxon per metagenomic dataset, is one important factor that keeps the datasets manageable. Instead of growing the dataset sizes (often with several hundred million individual sequences per dataset), the data products now are more or less static in size.
  • Single similarity computing step per feature type. By running exactly one similarity computation for proteins and another one for rRNA features, we have limited the computational requirements.
  • Clustering of features. By clustering features at 90% identity, we reduce the number of times we compute on similar proteins. Abundant features will be even more efficiently clustered, leading to more compression for abundant species.
MG-RAST v3 data model.

MG-RAST v3 data model.

As shown in Figure 2.2, MG-RAST relies on abundance profiles to capture information for each metagenome. The following abundance profiles are calculated for every metagenome.

  • MD5s – number of sequences (clusters) per database entry in the M5nr.
  • functions – summary of all the MD5s that match a given function.
  • ontologies – summary of all the MD5s that match a given hierarchy entry.
  • organisms – summary of all MD5s that match a given taxon entry.
  • lowest common ancestors

The static helper tables (show in blue in Figure [fig:mgrast_analysis-schema]) help keep the main tables smaller, by normalizing and providing integer representations for the entities in the abundance profiles.

THIS NEEDS TO BE REDONE!!!!!!

[fig:mgrast_analysis-schema]

The MG-RAST pipeline

Details of the analysis pipeline for MG-RAST version 3

Details of the analysis pipeline for MG-RAST version 3

MG-RAST provides automated processing of environmental DNA sequences via a pipeline. The pipeline has multiple steps that can be grouped into five stages:

We restrict the pipeline annotations to protein coding genes and ribosomal RNA (rRNA) genes.

  • Data hygiene:
    Quality control and removal of artifacts.
  • Feature extraction:
    Identification of protein coding and rRNA features (aka “genes”)
  • Feature annotation:
    Identification of putative functions and taxonomic origins for each of the features
  • Profile generation:
    Creation of multiple on disk representations of the information obtained above.
  • Data loading:
    Loading the representations into the appropriate databases.

The pipeline shown in Figure 3.1 contains a significant number of improvements over previous versions and is optimized for accuracy and computational cost.

Using the M5nr (Wilke et al. 2012) (an MD5 nonredundant database), the new pipeline computes results against many reference databases instead of only SEED. Several key algorithmic improvements were needed to support the flood of user-generated data (see Figure [fig:mgrast-job-sizes]). Using dedicated software to perform gene prediction instead of using a similarity-based approach reduces runtime requirements. The additional clustering of proteins at 90% identity reduces data while preserving biological signals.

Below we describe each step of the pipeline in some detail. All datasets generated by the individual stages of the processing pipeline are made available as downloads. Appendix 11 lists the available files for each dataset.

Data hygiene

Preprocessing

After upload, data is preprocessed by using SolexaQA (Cox, Peterson, and Biggs 2010) to trim low-quality regions from FASTQ data. Platform-specific approaches are used for 454 data submitted in FASTA format: reads more than than two standard deviations away from the mean read length are discarded following (Huse et al. 2007). All sequences submitted to the system are available, but discarded reads will not be analyzed further.

Dereplication

For shotgun metagenome and shotgun metatranscriptome datasets we perform a dereplication step. We use a simple k-mer approach to rapidly identify all 20 character prefix identical sequences. This step is required in order to remove Artificial Duplicate Reads (ADRs) (Gomez-Alvarez, Teal, and Schmidt 2009). Instead of simply discarding the ADRs, we set them aside and use them later for error estimation.

We note that dereplication is not suitable for amplicon datasets that are likely to share common prefixes.

DRISEE

MG-RAST v3 uses DRISEE (Duplicate Read Inferred Sequencing Error Estimation) (Keegan et al. 2012) to analyze the sets of Artificial Duplicate Reads (ADRs) (Gomez-Alvarez, Teal, and Schmidt 2009) and determine the degree of variation among prefix-identical sequences derived from the same template. See Section 4.2 for details.

Screening

The pipeline provides the option of removing reads that are near-exact matches to the genomes of a handful of model organisms, including fly, mouse, cow, and human. The screening stage uses Bowtie (Langmead et al. 2009) (a fast, memory-efficient, short read aligner), and only reads that do not match the model organisms pass into the next stage of the annotation pipeline.

Note that this option will remove all reads similar to the human genome and render them inaccessible. This decision was made in order to avoid storing any human DNA on MG-RAST.

Feature identification

Protein coding gene calling

The previous version of MG-RAST used similarity-based gene predictions, an approach that is significantly more expensive computationally than de novo gene prediction. After an in-depth investigation of tool performance (Trimble et al. 2012), we have moved to a machine learning approach: FragGeneScan (Rho, Tang, and Ye 2010). Using this approach, we can now predict coding regions in DNA sequences of 75 bp and longer. Our novel approach also enables the analysis of user-provided assembled contigs.

We note that FragGeneScan is trained for prokaryotes only. While it will identify proteins for eukaryotic sequences, the results should be viewed as more or less random.

rRNA detection

An initial search using vsearch (???) against a reduced RNA database efficiently identifies ribosomal RNA. The reduced database is a 90% identity clustered version of the SILVA, Greengenes and RDP databases and is used to rapidly identify sequences with similarities to ribosomal RNA.

Feature annotation

Protein filtering

We indentify possibly protein coding regions overlapping ribosomal RNAs and exclude them from further processing.

AA clustering

MG-RAST builds clusters of proteins at the 90% identity level using the cd-hit (???) preserving the relative abundances. These clusters greatly reduce the computational burden of comparing all pairs of short reads, while clustering at 90% identity preserves sufficient biological signals.

Protein identification

Once created, a representative (the longest sequence) for each cluster is subjected to similarity analysis.

For rRNA similarities, instead of BLAST we use sBLAT, an implementation of the BLAT algorithm (Kent 2002), which we parallelized using OpenMP (Board 2011) for this work.

As of version 4.04 we have migrated to DIAMOND(Buchfink, Xie, and Huson 2015) to compute protein similarities against M5nr (Wilke et al. 2012). During computation protein and rRNA sequences are represented only via a sequenced derived identifier (an MD5 checksum). Once the computation completes, we generate a number of representations of the observed similarities for various purposes.

Once the similarities are computed, we present reconstructions of the species content of the sample based on the similarity results. We reconstruct the putative species composition of the sample by looking at the phylogenetic origin of the database sequences hit by the similarity searches.

Sequence similarity searches are computed against a protein database derived from the M5nr (Wilke et al. 2012), which provides nonredundant integration of many databases: GenBank,(Benson et al. 2013), SEED (Overbeek et al. 2005), IMG (Markowitz et al. 2008), UniProt (Magrane and Consortium 2011), KEGG (Kanehisa 2002), and eggNOGs (Jensen et al. 2008).

rRNA clustering

The rRNA-similar reads are then clustered at 97% identity using cd-hit, and the longest sequence is picked as the cluster representative.

rRNA identification

A BLAT similarity search for the longest cluster representative is performed against the M5rna database which integrates SILVA(Pruesse et al. 2007), Greengenes(DeSantis et al. 2006), and RDP(Cole et al. 2003).

Profile generation

In the final stage, the data computed so far is integrated into a number of data products. The most important one are the abundance profiles.

Abundance profiles represent a pivoted and aggregated version of the similarity files. We compute best hit, representative hit and LCA abundance profiles (see 4.5).

Database loading

In the final step the profiles are loaded into the respective databases.

MG-RAST data products

MG-RAST provices a number of data products in a variety of formats.

  • Fasta and FastQ
    Sequence data can be downloaded via the API and web interface as Fasta (or FastQ) files
  • JSON
    Metadata and Tables and other structured data can be downloaded via the APi or the web site in JSON format.
  • Spreadsheet
    Metadata and Tables can be downloaded as spreadsheets via the web interface.
  • SVG and PNG
    Images can be downloaded via the web site interface in SVG and PNG formast.
  • BIOM v1
    BIOM (McDonald et al. 2012) files can be downloaded via the web interface for use with e.g., QIIME (Caporaso et al. 2010).
  • Sequence data
    The originally submitted sequence data as well as the various subsets resulting from processing can be downloaded.
  • Metadata
    data describing data in GSC-compliant format.
  • Analysis results – results of running the MG-RAST pipeline. The list includes all intermediate data products and is intended to serve as a basis for further analysis outside the MG-RAST pipeline.

    Details on the individual files are in Appendix 11.

Abundance profiles

Abundance profiles are the primary data product that MG-RAST’s user interface uses to display information on the datasets.

Using the abundance profiles, the MG-RAST system defers making a decision on when to transfer annotations. Since there is no well-defined threshold that is acceptable for all use cases, the abundance profiles contain all similarities and require their users to set cut-off values.

The threshold for annotation transfer can be set by using the following parameters: e-value, percent identity, and minimal alignment length.

The taxonomic profiles use the NCBI taxonomy. All taxonomic information is projected against this data. The functional profiles are available for data sources that provide hierarchical information. These currently comprise the following.

  • SEED Subsystems

    The SEED subsystems(Overbeek et al. 2005) represent an independent reannotation effort that powers, for example, the RAST(Aziz et al. 2008) effort. Manual curation of subsystems makes them an extremely valuable data source.

    Subsystems represent a four-level hierarchy:

    1. Subsystem level 1 – highest level
    2. Subsystem level 2 –
    3. Subsystem level 3 – similar to a KEGG pathway
    4. Subsystem level 4 – actual functional assignment to the feature in question

    The page at http://pubseed./SubsysEditor.cgi allows browsing the subsystems.

  • KEGG Orthologs

    We use the KEGG(Kanehisa 2002) enzyme number hierarchy to implement a four-level hierarchy.

    1. KEGG level 1 – first digit of the EC number (EC:X.*.*.*)
    2. KEGG level 2 – first two digits of the EC number (EC:X.Y.*.*)
    3. KEGG level 3 – first three digits of the EC number (EC:X:Y:Z:.*)
    4. KEGG level 4 – entire four digits EC number

    We note that KEGG data is no longer available for free download. We thus have to rely on using the latest freely downloadable version of the data.

    The high-level KEGG categories are as follows.

    1. Cellular Processes
    2. Environmental Information Processing
    3. Genetic Information Processing
    4. Human Diseases
    5. Metabolism
    6. Organizational Systems
  • COG and EGGNOG Categories

    The high-level COG and EGGNOG categories are as follows.

    1. Cellular Processes
    2. Information Storage and Processing
    3. Metabolism
    4. Poorly Characterized

    We note that for most metagenomes the coverage of each of the four namespaces is quite different. The “source hits distribution” (see Section [section:source-hits-distribution]) provides information on how many sequences per dataset were found for each database.

DRISEE profile

DRISEE (Keegan et al. 2012) is a method for measuring sequencing error in whole-genome shotgun metagenomic sequence data that is independent of sequencing technology and overcomes many of the shortcomings of Phred. It utilizes artificial duplicate reads (ADRs) to generate internal sequence standards from which an overall assessment of sequencing error in a sample is derived. The current implementation of DRISEE is not suitable for amplicon sequencing data or other samples that may contain natural duplicated sequences (e.g., eukaryotic DNA where gene duplication and other forms of highly repetitive sequences are common) in high abundance.   DRISEE results are presented on the Overview page for each MG-RAST sample for which a DRISEE profile can be determined. Total DRISEE error presents the overall DRISEE-based assessment of the sample as a percent error:

    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息,谨防诈骗。如发现有害或侵权内容,请点击一键举报。
    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多