GenoSIS: Genome Data Interpretation Using GISMary E. Dolan, Constance Holden, M. Kate Beard, and Carol J. BultAbstract
IntroductionBackgroundRationale for developing a Genome Spatial Information System (GenoSIS)The primary motivation for developing GenoSIS is to support the use of sequence feature maps as tools for pattern discovery as well as graphical abstraction of genome content. It is our assumption that, in addition to having a “parts list” of an organism’s genome, researchers want to explore potential biological significance in how genome features are organized. The visualization component of GenoSIS is more dynamic than genome browsers that are “display only.” For example, GenoSIS allows users to create attribute choropleth maps on the fly in response to such simple queries as “Draw a sequence feature map in which all genome features that are annotated as being involved in DNA replication are displayed in blue.” By integrating pattern detection and pattern matching methods directly with genome visualization, GenoSIS can be used as a tool for generating hypotheses about the biological significance of genome feature organization. The layered map, a concept fundamental to GIS, provides a useful approach for the integration of diverse biological data: DNA and protein sequence data, gene functionality data, biochemical pathways data, and even image data can be coordinated by location in the genome as different layers of a single genome map. One can mix and match queries, analysis, and visualization among and within layers. The particular advantages to representing genome data in an already developed spatial information system like ArcView (ESRI, 2001) integrated with the Oracle Spatial database (Oracle Technology Network, 2001) include: · Powerful standard built-in GIS visualization, query, and analysis tools; The combination of these features makes for a very powerful analytical tool. Functionally, GenoSIS allows users to view sequence feature maps as graphical objects at user defined scales of resolution. Query generation and response is tightly integrated with the visualization component of the system. Multiple scales allow for detection of patterns that might be scale-dependent. If a significant pattern is detected, a user has the ability to test the statistical significance of the pattern and also to use the pattern as a query to search for it in another genome. The problem of the data deluge“It now is commonplace to describe molecular biology as being in the middle of an information explosion...this explosion of information is changing the way science is conducted in the field of molecular biology...” (Collado-Vides, 1996). “An unprecedented wealth of data is being generated by genome-sequencing projects and other experimental efforts to determine the structure and function of biological molecules. The demands and opportunities for interpreting these data are expanding more than ever.” (Baldi, 1998) An organism’s genome is the full complement of its DNA, which is organized as one or more chromosomes. Arranged along each chromosome in a linear order are thousands of genome features of biological significance that have been annotated by molecular geneticists. These features include genes and the regulatory elements that control expression of genes. In the last few decades, advances in molecular biology and the equipment available for research in this field have allowed the increasingly rapid sequencing of the genomes of many species and, currently, the whole genomes of some 800 organisms is available. GenBank (Benson, 2000) is a widely used repository of DNA sequence information. As of April 2002, there are approximately 19,073,000,000 bases in 16,770,000 sequence records in GenBank, and the resource is growing exponentially ( http://www.ncbi.nlm./Genbank/GenbankOverview.html ). The figure below from the Oak Ridge National Laboratory Primer on Molecular Genetics (http://www./hgmis/publicat/primer/toc.html ) gives a sense of the amount of information in the human genome.
The biological significance of genome feature contextThere are many cases where the spatial organization of genome features has been shown to have biological significance. Recent studies related to gene finding and genome annotation (Rogic et al., 2001), provide evidence of complex spatial interrelationships of genome features: genes are alternatively spliced, genes may be nested within other genes, and genes may overlap. Conservation of gene order in microbes has been a subject of a great deal of analysis (Tamames, 2001). The draft assembly of the mouse genome has just recently been made public ( http://www./Info/Press/020506.shtml ) and annotation of one particular mouse chromosome (Mural, 2002) seems to indicate that humans and mice share about 97.5 per cent of their working DNA. The arrangement of a number of genome features is shown in the chromosome segment below. Color-coding of different non-spatial attributes of the features and side-by-side alignment are used to illustrate the annotation.
This level of human-mouse similarity is somewhat surprising since this is just one percent less than the amount shared by humans and chimpanzees. Previous estimates had been that humans and mice would differ by as much as 15 percent. In all likelihood this conservation is the result of preserving essential functions from the two species’ common ancestor 100 million years ago until today. But, in addition, the researchers speculate that the genes might actually all be identical and that the differences between the species may be due to differences in the regulatory elements that control the expression of the genes. It is well known, for example, that certain regulatory elements have spatial dependencies relative to transcriptional start sites of genes. Previous work has shown that the genetic similarity of the superficially dissimilar mouse and human species is such that the human chromosomes can be cut up into some 150 segments and reassembled to a close approximation of the mouse genome as shown below.
U.S. Department of Energy Human Genome Program: http://www./hgmis
The use of maps in genetics
The figures above are taken from an article (Yunis, 1982) comparing the chromosomes of human, gorilla, chimpanzee, and orangutan. The figure on the left shows a chromosome 5 image of the four species placed side by side. Depending on chromosome structure and biochemical composition, a chromosome shows distinct patterns of segments or bands of light and dark staining, when treated with a dye and observed under a microscope. The corresponding figure on the right shows a cytogenetic map, a stylized map of the chromosomes indicating the characteristic bands. The map clearly shows regions of significant similarity or homology as well as regions of significant difference among species. Another type of map used by geneticists, the genetic-linkage map, was developed in 1913, before scientists knew that genes were made of DNA, to study the spatial association of genes in fruit flies. Rather that actual location along the chromosome, the linkage map shows the relative position of genes based on the rate at which two different genes are inherited together or separately in genetic studies. With the advent of methods for sequencing DNA and manipulating cloned DNA, it is now possible, in principle, to produce a physical map, which associates a precise position on the chromosome with each gene. Additional methods allow genome cartographers to combine the landmarks of these various maps as shown below in order to take advantage of the best features of each type of map. The chromosomal and linkage maps show an entire chromosome with corresponding positions connected by lines; the physical maps, measured in kilobases (kb) show a detailed view of a portion of the chromosome with the last map indicating a single gene with more detail of structure visible at this scale.
http://www.informatics./silver/frame1.3.shtml (Silver, 1995) ImplementationSpatial genomics data model The figure below (Casey, 1992) is a representation of the process we are concerned with in the current version of GenoSIS: In the nucleus of a cell a segment of DNA that is a gene is copied (transcribed) to messenger RNA (mRNA) which is transported to another part of the cell, the ribosome, in which the mRNA is translated to a chain of amino acids which will become a protein. Each protein performs a certain role in the cell usually interacting with other proteins in a series of complex biochemical pathways. Functionally, at the least detailed level of our system is the particular organism, which has one or more genomes, which has one or more chromosomes, which has numerous features. Both chromosomes and features are DNA sequences, at the most detailed level of resolution we provide the user access to the sequence information. Since features may be composite, for example, in some organisms genes are made up of exons (parts that are translated) and introns (parts that are not translated), we indicate that a feature may be itself be composed of a set of “subfeatures.”
The object-relational spatial genomics database schemaWe began by describing a spatial genomics data model that is intuitive to the biologist querying genome feature information by attempting to make a comprehensive list of features and interactions in the part of the real world of interest for the biologist. This list is the starting point for constructing a list of entities and relationships in our conceptual model. We implemented the data model in an Oracle 8i object-relational database (ORDB). We used several ORDB features to represent complex structured data: user-defined object types, references, and nested tables. The figure below represents the implemented data model, showing how various aspects of the data model correspond to particular facilities of Oracle 8i.
An object-relational database allows for user-defined object types, which makes building the database more intuitive. Object types can be used to map an object data model directly to an object-relational database schema, rather than restructuring the data model into the flattened row-column format of relational tables in a purely relational database. An object-relational database allows for the use of user-defined object types in application programs that access the database, which makes using the database more intuitive. Application programs can retrieve and manipulate the data as objects and call procedures that use the methods of the object type to perform operations on the object. Since the methods can be stored in the database, data-intensive procedures can be more efficient. Objects can be reused, which makes application development faster and more efficient since the use of objects relieves developers of the need to write a mapping layer between application program objects and database objects. The use of objects, based on the underlying software engineering principle of data abstraction and encapsulation, also makes it easier to understand application program code and to maintain application programs. Each of the objects in our data model is made into an object type: organism, genome, chromosome, DNA sequence, feature, feature set, transcript, protein, and role. The role object points to a set of relational tables representing the current Gene Ontology, which is imported “as is” from the GO site (Gene Ontology, 2000). An object table is built from each of these object types, for example, a table of organisms. Each organism has attributes including an identification number, a genus, a species, and a set of genomes. Most of the organism object attributes are standard SQL data types. The organism type is created with attributes: identification number of number type; genus and species of text string type. But its “genomes” attribute is not a simple value; it is a set of genome objects. We represent the organism’s set of genomes as a nested table, a data structure that is part of the Oracle ORDB system. A nested table is an unordered set of data elements, all of the same data type. Nested tables are useful for representing a containment hierarchy or a one-to-many relationship. In our model an organism stands in a one-to-many relationship with its genomes. So we represent an organism’s genomes as a nested object table of genome objects. Similarly, for a genome’s set of chromosomes, the chromosome’s set of features, and so forth. Large objects (LOB) are designed to support unstructured data, which tend to be large and cannot be decomposed into standard components. Our model implementation uses character large object (CLOB) type for the nucleotide sequence attribute of DNA sequence object and for the amino acid sequence attribute of protein objects. Spatial data objects (SDO) are implemented in Oracle Spatial (Oracle Technology Network), which is an integrated set of functions and procedures that enables spatial data to be stored, accessed, and analyzed quickly and efficiently in an Oracle8i database. Spatial data represents the location characteristics of real or conceptual objects as those objects relate to the real or conceptual space in which they exist. Any spatial object will have a spatial attribute, which is the geometric representation of its shape in some coordinate space. This is referred to as its geometry and is a vector-based representation of the shape of the feature, for example, an ordered sequence of vertices that are connected by straight-line segments or circular arcs. Oracle Spatial supports three geometric primitive types: points and point clusters; line strings; n-point polygons and geometries composed of collections of these types. Spatial objects can be queried through a set of operators and functions for performing area-of-interest and spatial join queries. These methods determine the spatial relationships between entities in the database based on geometric locations, topologies and distances. Overall, there are three main implementation components to the current prototype Genome Spatial Information System architecture as shown in figure: · Genome sequence and feature information are extracted from GenBank (and other source) flat text files using a Perl script,
Using GIS for thematic genome mappingMaps communicate effectively because humans are very good at quickly extracting patterns and information from spatial depictions, whether the underlying space is a physical or conceptual space. We use the experience of cartographers in visualizing genome features on chromosome maps: adjustable scales and viewing perspectives; highlighting features of interest or filtering those with certain properties; the use of symbols and colors to aid in interpretation.In a GIS, features are categorized separately and stored in different map layers, which share a common coordinate space. For example, streets and roads might be stored one layer. Buildings might be stored in another layer. Layers can be added that locate measured data, such as annual rainfall, or that locate the occurrence of events, such as disease incidents. This way of organizing data in the GIS makes maps much more flexible to use since these layers can be combined in any manner that is useful. Similarly, the chromosome defines a genome coordinate space. Separate layers can be defined to contain different types of features such as genes and regulatory elements.
Spatial data analysis and visualizationBy modeling genomes as spatial objects we can adapt and apply spatial analysis tools to genome features. As an example of the kinds of tools we use, we describe the genomic application of two particularly useful techniques that have been successfully used by spatial scientists: exploratory spatial data analysis (ESDA) and thematic mapping and visualization, as used in GIS. These techniques are based on linking numerical and graphical procedures with a map, that is, a symbolic spatial representation of the underlying space. ESDA and thematic mapping can be applied to genomics data by representing a linear chromosome as a one-dimensional linear map or by representing a circular genome as a circular map with biologically significant features located on the map. The analysis, queries, and visualization may deal with data globally by processing cases for the whole map or locally by processing subsets of the data focused on a part or region of the map and which may involve a sweep through the data region by region.ESDA is exploratory data analysis (EDA) of data that are identified according to their locations. The aim of non-spatial EDA is to identify data properties for purposes of pattern detection, hypothesis formulation, and model assessment. Extending this to spatial properties of data requires additional techniques to those found in EDA for detecting spatial patterns and anomalies in data spatial autocorrelation, formulating hypotheses based on the location of the data, and assessing spatial models. Complementary to the data analysis methods are the visualization methods used in GIS, which tend to focus on the presentation of spatial properties of the data such as location, size, distance, pattern, and inter-object relations. The usefulness of this approach to genomic data is obvious when one considers a set of questions that Frances Slater states should be in every geography inventory (Slater, 1982): · Where is it? How much is there at that location? In much the same way, a biologist might ask: · Where do we find consensus sequence elements (CSEs)? How many elements are there at that genomic region?A sequence feature map is a one-dimensional, graphical representation of all recognized sequence features. Sequence feature maps are understood to be over-simplifications of what genome space is actually like. Sequence features that appear to be spatially disjunct according to a linear representation of a genome, may actually be close neighbors due to the folding of DNA into a multi-dimensional molecule. As the folding of genomes is better understood, we will incorporate this knowledge into a more complex graphical representation method. In the meantime the simpler one-dimensional sequence feature map representation method will be employed for this project for displaying genome sequence features. Spatial queries in a genomics context require reasoning about the spatial organization of genome features represented on a sequence feature map. Spatial queries use spatial operators (before, after, contains, overlaps) in much the same way that keyword queries use Boolean operators (and, or, not). Our query formalism takes into account metrics, uncertainty, and strandedness when reasoning about sequence features and their attributes. At its simplest, spatial dependencies are statements or tests about the co-distribution or co-variation of sequence features and/or sequence feature attributes along a genome. We have incorporated into GenoSIS statistical methods to test whether sequence features are clustered or regular in their distribution. We have also incorporated tests for detecting dependencies in attributes, such as gene expression: for example, is the expression of gene A correlated with the expression of gene B? A genome neighborhood refers to the order and distance of a user-selected set of sequence features along a genome. By defining a genome neighborhood, biologists can ask comparative contextual queries across genomes. In other words, instead of asking "Is there a gene in organism X that is like the gene I observed in organism Y?" we can ask, "Does the gene in organism X occur in the same context or gene neighborhood as the gene I observed in organism Y?" The figure below shows a GenoSIS screenshot in which ArcView is used for the study of a particular microbial genome. The built-in query and panning-zooming functions of ArcView allow the user to focus on particular genome features. Non-spatial attributes of features can be used for selection and for altering display characteristics. Many built-in tools are available to the ArcView user. As shown, we have used the “Identify” tool, which will display all attributes of a feature selected by clicking; the “Measure” tool, which will calculate the distance in the conceptual space; and the “Label” tool, which can be used to label any feature by any or all of its non-spatial attributes. As an aid to distinguish closely spaced features, they can be individually colored.
Conclusions.We have developed a Genome Spatial Information System based on the idea of “spatial genomics,” applying concepts and tools of spatial analysis and GIS to the interpretation of genome data. We have used “off the shelf” GIS software to query and visualize genome data and applied customized spatial data analysis tools as a novel approach to several problems of current interest to molecular geneticists.Acknowledgements.We acknowledge support from NSF DBI-9723873 and DOE DE-FGO2-99ER62850. This project benefited from the efforts of two undergraduate research assistants, Suzannah Hall and Amber Bethell. We thank Dr. Tom Wheeler for valuable discussions.References.Baldi, Pierre and Soren Brunak. (1998) Bioinformatics: The Machine Learning Approach, MIT Press, Cambridge. Mary E. Dolan (The Jackson Laboratory) Department of Spatial Information Science and Engineering 5711 Boardman Hall Room 329 The University of Maine Orono, ME 04469-5711, USA Office: (207) 581-2143 Fax: (207) 581-2206 mary_dolan@umit.maine.edu Constance Holden Department of Spatial Information Science and Engineering 5711 Boardman Hall Room 125 The University of Maine Orono, ME 04469-5711, USA M. Kate Beard 5711 Boardman Hall Room 348 Orono, Maine 04469-5711, USA Carol J. Bult The Jackson Laboratory 600 Main Street Bar Harbor, ME 04609, USA |
|